Or to be more general, it's DT[,.SD[...],by=...]
versus merge(aggregate(...))
.
Without further ado, here's data and example:
set.seed(5141)
size = 1e6
df <- data.table(a = rnorm(size),
b = paste0(sample(letters, size, T),
sample(letters, size, T),
sample(letters, size, T)),
c = sample(1:(size/10), size, T),
d = sample(seq.Date(as.Date("2015-01-01"),
as.Date("2015-05-31"), by="day"), size, T))
system.time(df[,.SD[d == max(d)], by = c])
# user system elapsed
# 50.89 0.00 51.00
system.time(merge(aggregate(d ~ c, data = df, max), df))
# user system elapsed
# 18.24 0.20 18.45
Usually having no problem with data.table
performance, I got surprised by this particular example. I had to subset (aggregate) a fairly large data frame by taking only latest (can be simultaneous) occurrences of some event types. And keep the rest of relevant data for those particular events. However, it seems that .SD
doesn't scale well in this particular application.
Is there a better "data table way" to tackle this kind of tasks?
We can use .I
to get the row index and subset the rows based on that. It should be faster.
system.time(df[df[,.I[d == max(d)], by = c]$V1])
# user system elapsed
# 5.00 0.09 5.30
@Heroka's solution
system.time(df[,is_max:=d==max(d), by = c][is_max==T,])
# user system elapsed
# 5.06 0.00 5.12
The aggregate
method on my machine gives
system.time(merge(aggregate(d ~ c, data = df, max), df))
# user system elapsed
# 48.62 1.00 50.76
with the .SD
option
system.time(df[,.SD[d == max(d)], by = c])
# user system elapsed
# 151.13 0.40 156.57
Using the data.table
join
system.time(df[df[, list(d=max(d)) , c], on=c('c', 'd')])
# user system elapsed
# 0.58 0.01 0.60
If we look at the comparisons between the merge/aggregate
and the ==
, they are different functions. Usually, the aggregate/merge
method will be slower when compared to the coresponding join with data.table
. But, instead we are using ==
that compares every row (takes some time) along with .SD
for subsetting (which also is relatively less efficient when compared to .I
for row indexing). The .SD
also has the overhead of [.data.table
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With