I'm doing some aggregations on a data.table (excellent package!!!) and I found the .SD variable very useful for many things. However, using it slows down computation significantly when there are many groups. Follows an example:
# A moderately big data.table
x = data.table(id=sample(1e4,1e5,replace=T),
code=factor(sample(2,1e5,replace=T)),
z=runif(1e5)
)
setkey(x,id,code)
system.time(x[,list(code2=nrow(.SD[code==2]), total=.N), by=id])
## user system elapsed
## 6.226 0.000 6.242
system.time(x[,list(code2=sum(code==2), total=.N), by=id])
## user system elapsed
## 0.497 0.000 0.498
system.time(x[,list(code2=.SD[code==2,.N], total=.N), by=id])
## user system elapsed
## 6.152 0.000 6.168
Am I doing something wrong? Should I avoid .SD in favor of individual columns? Thanks in advance.
Am I doing something wrong i.e. should I avoid
.SD
in favor of individual columns ?
Yes, exactly. Only use .SD
if you really are using all the data inside .SD
. You might also find that the call to nrow()
and the subquery call to [.data.table
inside j
are culprits too : use Rprof
to confirm.
See the last few sentences of FAQ 2.1 :
FAQ 2.1 How can I avoid writing a really long j expression? You've said I should use the column names, but I've got a lot of columns.
When grouping, thej
expression can use column names as variables, as you know, but it can also use a reserved symbol.SD
which refers to the Subset of the Data.table for each group (excluding the grouping columns). So to sum up all your columns it's justDT[,lapply(.SD,sum),by=grp]
. It might seem tricky, but it's fast to write and fast to run. Notice you don't have to create an anonymous function. See the timing vignette and wiki for comparison to other methods. The.SD
object is efficiently implemented internally and more ecient than passing an argument to a function. Please don't do this though :DT[,sum(.SD[,"sales",with=FALSE]),by=grp]
. That works but is very inefficient and inelegant. This is what was intended:DT[,sum(sales),by=grp]
and could be 100's of times faster.
Also see the first bullet of FAQ 3.1 :
FAQ 3.1 I have 20 columns and a large number of rows. Why is an expression of one column so quick?
Several reasons:
-- Only that column is grouped, the other 19 are ignored becausedata.table
inspects thej
expression and realises it doesn't use the other columns.
When data.table
inspects j
and sees the .SD
symbol, that efficiency gain goes out the window. It will have to populate the whole of the .SD
subset for each group even if you don't use all its columns. It's very difficult for data.table
to know which columns of .SD
you are really using (j
could contain if
s, for example). However, if you need them all anyway, it doesn't matter of course, such as in DT[,lapply(.SD,sum),by=...]
. That's ideal use of .SD
.
So, yes, avoid .SD
wherever possible. Use column names directly to give data.table's optimization of j
the best chance. The mere existence of the symbol .SD
in j
is important.
This is why .SDcols
was introduced. So you can tell data.table
which columns should be in .SD
if you only want a subset. Otherwise, data.table
will populate .SD
with all the columns just in case j
needs them.
Try solving this by breaking the calculations into two steps, then merging the resulting data frames:
system.time({
x2 <- x[code==2, list(code2=.N), by=id]
xt <- x[, list(total=.N), by=id]
print(x2[xt])
})
On my machine it runs in 0.04 seconds as opposed to 7.42 seconds, i.e. ~200 times faster than your original code:
id code2 total
1: 1 6 14
2: 2 8 10
3: 3 7 13
4: 4 5 13
5: 5 9 18
---
9995: 9996 4 9
9996: 9997 3 6
9997: 9998 6 10
9998: 9999 3 4
9999: 10000 3 6
user system elapsed
0.05 0.00 0.04
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With