Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table slow aggregation when using .SD

Tags:

r

data.table

I'm doing some aggregations on a data.table (excellent package!!!) and I found the .SD variable very useful for many things. However, using it slows down computation significantly when there are many groups. Follows an example:

# A moderately big data.table
x = data.table(id=sample(1e4,1e5,replace=T),
               code=factor(sample(2,1e5,replace=T)),
               z=runif(1e5)
              )

setkey(x,id,code)

system.time(x[,list(code2=nrow(.SD[code==2]), total=.N), by=id])
##  user  system elapsed 
##  6.226   0.000   6.242

system.time(x[,list(code2=sum(code==2), total=.N), by=id])
## user  system elapsed 
## 0.497   0.000   0.498

system.time(x[,list(code2=.SD[code==2,.N], total=.N), by=id])
## user  system elapsed 
## 6.152   0.000   6.168

Am I doing something wrong? Should I avoid .SD in favor of individual columns? Thanks in advance.

like image 508
vsalmendra Avatar asked Mar 07 '13 14:03

vsalmendra


2 Answers

Am I doing something wrong i.e. should I avoid .SD in favor of individual columns ?

Yes, exactly. Only use .SD if you really are using all the data inside .SD. You might also find that the call to nrow() and the subquery call to [.data.table inside j are culprits too : use Rprof to confirm.

See the last few sentences of FAQ 2.1 :

FAQ 2.1 How can I avoid writing a really long j expression? You've said I should use the column names, but I've got a lot of columns.
When grouping, the j expression can use column names as variables, as you know, but it can also use a reserved symbol .SD which refers to the Subset of the Data.table for each group (excluding the grouping columns). So to sum up all your columns it's just DT[,lapply(.SD,sum),by=grp]. It might seem tricky, but it's fast to write and fast to run. Notice you don't have to create an anonymous function. See the timing vignette and wiki for comparison to other methods. The .SD object is efficiently implemented internally and more ecient than passing an argument to a function. Please don't do this though : DT[,sum(.SD[,"sales",with=FALSE]),by=grp]. That works but is very inefficient and inelegant. This is what was intended: DT[,sum(sales),by=grp] and could be 100's of times faster.

Also see the first bullet of FAQ 3.1 :

FAQ 3.1 I have 20 columns and a large number of rows. Why is an expression of one column so quick?
Several reasons:
-- Only that column is grouped, the other 19 are ignored because data.table inspects the j expression and realises it doesn't use the other columns.

When data.table inspects j and sees the .SD symbol, that efficiency gain goes out the window. It will have to populate the whole of the .SD subset for each group even if you don't use all its columns. It's very difficult for data.table to know which columns of .SD you are really using (j could contain ifs, for example). However, if you need them all anyway, it doesn't matter of course, such as in DT[,lapply(.SD,sum),by=...]. That's ideal use of .SD.

So, yes, avoid .SD wherever possible. Use column names directly to give data.table's optimization of j the best chance. The mere existence of the symbol .SD in j is important.

This is why .SDcols was introduced. So you can tell data.table which columns should be in .SD if you only want a subset. Otherwise, data.table will populate .SD with all the columns just in case j needs them.

like image 128
Matt Dowle Avatar answered Oct 12 '22 13:10

Matt Dowle


Try solving this by breaking the calculations into two steps, then merging the resulting data frames:

system.time({
  x2 <- x[code==2, list(code2=.N), by=id]
  xt <- x[, list(total=.N), by=id]
  print(x2[xt])
})

On my machine it runs in 0.04 seconds as opposed to 7.42 seconds, i.e. ~200 times faster than your original code:

         id code2 total
   1:     1     6    14
   2:     2     8    10
   3:     3     7    13
   4:     4     5    13
   5:     5     9    18
  ---                  
9995:  9996     4     9
9996:  9997     3     6
9997:  9998     6    10
9998:  9999     3     4
9999: 10000     3     6
   user  system elapsed 
   0.05    0.00    0.04 
like image 42
Andrie Avatar answered Oct 12 '22 13:10

Andrie