Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: data.table vs merge(aggregate()) performance

Tags:

r

data.table

Or to be more general, it's DT[,.SD[...],by=...] versus merge(aggregate(...)).

Without further ado, here's data and example:

set.seed(5141)
size = 1e6
df <- data.table(a = rnorm(size),
                 b = paste0(sample(letters, size, T), 
                            sample(letters, size, T), 
                            sample(letters, size, T)),
                 c = sample(1:(size/10), size, T),
                 d = sample(seq.Date(as.Date("2015-01-01"), 
                                     as.Date("2015-05-31"), by="day"), size, T))

system.time(df[,.SD[d == max(d)], by = c])
# user  system elapsed 
# 50.89    0.00   51.00 
system.time(merge(aggregate(d ~ c, data = df, max), df))
# user  system elapsed 
# 18.24    0.20   18.45 

Usually having no problem with data.table performance, I got surprised by this particular example. I had to subset (aggregate) a fairly large data frame by taking only latest (can be simultaneous) occurrences of some event types. And keep the rest of relevant data for those particular events. However, it seems that .SD doesn't scale well in this particular application.

Is there a better "data table way" to tackle this kind of tasks?

like image 704
statespace Avatar asked Feb 05 '16 11:02

statespace


1 Answers

We can use .I to get the row index and subset the rows based on that. It should be faster.

system.time(df[df[,.I[d == max(d)], by = c]$V1])
#    user  system elapsed 
#   5.00    0.09    5.30 

@Heroka's solution

system.time(df[,is_max:=d==max(d), by = c][is_max==T,])
#   user  system elapsed 
#  5.06    0.00    5.12 

The aggregate method on my machine gives

system.time(merge(aggregate(d ~ c, data = df, max), df))
#   user  system elapsed 
#  48.62    1.00   50.76 

with the .SD option

system.time(df[,.SD[d == max(d)], by = c])
#   user  system elapsed 
# 151.13    0.40  156.57 

Using the data.table join

system.time(df[df[, list(d=max(d)) , c], on=c('c', 'd')])
#   user  system elapsed 
#   0.58    0.01    0.60 

If we look at the comparisons between the merge/aggregate and the ==, they are different functions. Usually, the aggregate/merge method will be slower when compared to the coresponding join with data.table. But, instead we are using == that compares every row (takes some time) along with .SD for subsetting (which also is relatively less efficient when compared to .I for row indexing). The .SD also has the overhead of [.data.table.

like image 76
akrun Avatar answered Sep 26 '22 23:09

akrun