Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid weird umlaute error when using data.table

I need to operate sums on a sparse dataframe considering the IDs

require(data.table)
sentEx = structure(list(abend = c(1, 1, 0, 0, 2), aber = c(0, 1, 0, 0, 
0), über = c(1, 0, 0, 0, 0), überall = c(0, 0, 0, 0, 0), überlegt = c(0, 
0, 0, 0, 0), ID = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("0019", 
"0021"), class = "factor"), abgeandert = c(1, 1, 1, 0, 0), abgebildet = c(0, 
0, 1, 1, 0), abgelegt = c(0, 0, 0, 0, 3)), .Names = c("abend", 
"aber", "über", "überall", "überlegt", "ID", "abgeandert", "abgebildet", 
"abgelegt"), row.names = c(1L, 2L, 16L, 17L, 18L), class = "data.frame")

sentEx  # How it looks
   abend aber über überall überlegt   ID abgeandert abgebildet abgelegt
1      1    0    1       0        0 0019          1          0        0
2      1    1    0       0        0 0019          1          0        0
16     0    0    0       0        0 0021          1          1        0
17     0    0    0       0        0 0021          0          1        0
18     2    0    0       0        0 0021          0          0        3

Without "umlaute" it works fine:

sentEx.dt <- data.table(sentEx[,-c(3,4,5)])[, lapply(.SD, sum), by=ID]
(sentExSum <- as.data.frame(sentEx.dt))  # Need again as dataframe, which looks like:
    ID abend aber abgeandert abgebildet abgelegt
1 0019     2    1          2          0        0
2 0021     2    0          1          2        3 

But otherwise i get this error:

sentEx.dt <- data.table(sentEx)[, lapply(.SD, sum), by=ID]
# Error in gsum(`über`) : object 'über' not found
      sentExSum <- as.data.frame(sentEx.dt)

Some Additional seesion info (since the issue seems to be system related - see comments):

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.2

loaded via a namespace (and not attached):
[1] plyr_1.8.1     Rcpp_0.11.0    reshape2_1.2.2 stringr_0.6.2  tools_3.0.2

Also requested commands:

require(data.table); test.data.table()
Running C:/Users/Krohana/Documents/R/win-library/3.0/data.table/tests/tests.Rraw 
Loading required package: reshape
Loading required package: hexbin
Loading required package: xts
Loading required package: bit64
Test 167.2 not run. If required call library(hexbin) first.
Don't know how to automatically pick scale for object of type ITime. Defaulting to continuous
Don't know how to automatically pick scale for object of type ITime. Defaulting to continuous
Tests 487 and 488 not run. If required call library(reshape) first.
Test 841 not run. If required call library(xts) first.
Tests 897-899 not run. If required call library(bit64) first.
All 1220 tests in inst/tests/tests.Rraw completed ok in 24.321sec on Sun Mar 02 17:57:26 2014 ts/tests.Rraw completed ok in 24.638sec on Sun Mar 02 17:55:45 2014

Requested commands2:

> Encoding(names(sentEx))
[1] "unknown" "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown" "unknown"
> options(datatable.verbose=TRUE)
> options(datatable.verbose=TRUE); options(datatable.optimize=1L);
like image 970
alex Avatar asked Oct 01 '22 10:10

alex


1 Answers

I couldn't reproduce either. But, there was another "object not found error" that Arun found, that I'm hoping is this one too.

Now in v1.9.3, commit 1212. From NEWS :

o An error "object [name] not found" could occur in some circumstances, particularly after a previous error. Reported with non-ASCII characters in a column name, a red herring we hope since non-ASCII characters are supported in column names in data.table. Fix implemented and tests added.

If it happens again, please let us know. Your test has been added verbatim to the test suite, thanks.

like image 158
Matt Dowle Avatar answered Oct 05 '22 11:10

Matt Dowle