I would like to use the data.table package in R to dynamically generate aggregations, but I am running into an error. Below, let my.dt
be of type data.table
.
sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
grouping.vars <- c("sex", "age")
for (i in 1:2) {
my.dt[,sum(dependent.variable), by=grouping.vars[i]]
}
If I run this, I get errors:
Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i] :
by must evaluate to list
Yet the following works without error:
my.dt[,sum(dependent.variable), by=sex]
I see why the error is occurring, but I do not see how to use a vector with the by
parameter.
[UPDATE] 2 years after question was asked ...
On running the code in the question, data.table
is now more helpful and returns this (using 1.8.2) :
Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i]) :
'by' appears to evaluate to column names but isn't c() or key(). Use by=list(...)
if you can. Otherwise, by=eval(grouping.vars[i]) should work. This is for efficiency
so data.table can detect which columns are needed.
and following the advice in the second sentence of error :
my.dt[,sum(dependent.variable), by=eval(grouping.vars[i])]
sex V1
1: M 2650
2: F 2600
Old answer from Jul 2010 (by
can now be double
and character
, though) :
Strictly speaking the by
needs to evaluate to a list of vectors each with storage mode integer, though. So the numeric vector age
could also be coerced to integer using as.integer()
. This is because data.table uses radix sorting (very fast) but the radix algorithm is specifically for integers only (see wikipedia's entry for 'radix sort'). Integer storage for key columns and ad hoc by
is one of the reasons data.table is fast. A factor is of course an integer lookup to unique strings.
The idea behind by
being a list()
of expressions is that you are not restricted to column names. It is usual to write expressions of column names directly in the by
. A common one is to aggregate by month; for example :
DT[,sum(col1), by=list(region,month(datecol))]
or a very fast way to group by yearmonth is by using a non epoch based date, such as yyyymmddL as seen in some of the examples in the package, like this :
DT[,sum(col1), by=list(region,month=datecol%/%100L)]
Notice how you can name the columns inside the list() like that.
To define and reuse complex grouping expressions :
e = quote(list(region,month(datecol)))
DT[,sum(col1),by=eval(e)]
DT[,sum(col2*col3/col4),by=eval(e)]
Or if you don't want to re-evaluate the by
expressions each time, you can save the result once and reuse the result for efficiency; if the by
expressions themselves take a long time to calculate/allocate, or you need to reuse it many times :
byval = DT[,list(region,month(datecol))]
DT[,sum(col1),by=byval]
DT[,sum(col2*col3/col4),by=byval]
Please see http://datatable.r-forge.r-project.org/ for latest info and status. A new presentation will be up there soon and hoping to release v1.5 to CRAN soon too. This contains several bug fixes and new features detailed in the NEWS file. The datatable-help list has about 30-40 posts a month which may be of interest too.
I did two changes to your original code:
sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
age<-as.factor(age)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
for ( a in 1:2){
print(my.dt[,sum(dependent.variable), by=list(sex,age)[a]])
}
Numerical vector age
should be forced into factors. As to by
parameter, do not use quote for column names but group them into list(...). At least this is what the author has suggested.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With