Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table and "by must evaluate to list" Error

Tags:

r

data.table

I would like to use the data.table package in R to dynamically generate aggregations, but I am running into an error. Below, let my.dt be of type data.table.

sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
grouping.vars <- c("sex", "age")
for (i in 1:2) {
     my.dt[,sum(dependent.variable), by=grouping.vars[i]]
}

If I run this, I get errors:

Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i] :
  by must evaluate to list

Yet the following works without error:

my.dt[,sum(dependent.variable), by=sex]

I see why the error is occurring, but I do not see how to use a vector with the by parameter.

like image 345
Ryan R. Rosario Avatar asked Jul 15 '10 02:07

Ryan R. Rosario


2 Answers

[UPDATE] 2 years after question was asked ...

On running the code in the question, data.table is now more helpful and returns this (using 1.8.2) :

Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i]) : 
  'by' appears to evaluate to column names but isn't c() or key(). Use by=list(...)
  if you can. Otherwise, by=eval(grouping.vars[i]) should work. This is for efficiency
  so data.table can detect which columns are needed.

and following the advice in the second sentence of error :

my.dt[,sum(dependent.variable), by=eval(grouping.vars[i])] 
   sex   V1
1:   M 2650
2:   F 2600



Old answer from Jul 2010 (by can now be double and character, though) :

Strictly speaking the by needs to evaluate to a list of vectors each with storage mode integer, though. So the numeric vector age could also be coerced to integer using as.integer(). This is because data.table uses radix sorting (very fast) but the radix algorithm is specifically for integers only (see wikipedia's entry for 'radix sort'). Integer storage for key columns and ad hoc by is one of the reasons data.table is fast. A factor is of course an integer lookup to unique strings.

The idea behind by being a list() of expressions is that you are not restricted to column names. It is usual to write expressions of column names directly in the by. A common one is to aggregate by month; for example :

DT[,sum(col1), by=list(region,month(datecol))]

or a very fast way to group by yearmonth is by using a non epoch based date, such as yyyymmddL as seen in some of the examples in the package, like this :

DT[,sum(col1), by=list(region,month=datecol%/%100L)]

Notice how you can name the columns inside the list() like that.

To define and reuse complex grouping expressions :

e = quote(list(region,month(datecol)))
DT[,sum(col1),by=eval(e)]
DT[,sum(col2*col3/col4),by=eval(e)]

Or if you don't want to re-evaluate the by expressions each time, you can save the result once and reuse the result for efficiency; if the by expressions themselves take a long time to calculate/allocate, or you need to reuse it many times :

byval = DT[,list(region,month(datecol))]
DT[,sum(col1),by=byval]
DT[,sum(col2*col3/col4),by=byval]

Please see http://datatable.r-forge.r-project.org/ for latest info and status. A new presentation will be up there soon and hoping to release v1.5 to CRAN soon too. This contains several bug fixes and new features detailed in the NEWS file. The datatable-help list has about 30-40 posts a month which may be of interest too.

like image 141
Matt Dowle Avatar answered Oct 18 '22 13:10

Matt Dowle


I did two changes to your original code:

sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21) 

age<-as.factor(age)

dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)

for ( a in 1:2){
print(my.dt[,sum(dependent.variable), by=list(sex,age)[a]]) 
}

Numerical vector age should be forced into factors. As to by parameter, do not use quote for column names but group them into list(...). At least this is what the author has suggested.

like image 22
Vulpecula Avatar answered Oct 18 '22 14:10

Vulpecula