Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is "by" on a vector not from a data.table column very slow?

Tags:

r

data.table

test <- data.table(x=sample.int(10, 1000000, replace=TRUE))
y <- test$x
test[,.N, by=x] # fast
test[,.N, by=y] # extremely slow

Why it is slow on the second case?

It is even faster to do this:

test[,y:=y]
test[,.N, by=y]
test[,y:=NULL]

It looks as if it is poorly optimized?

like image 590
colinfang Avatar asked Nov 14 '13 16:11

colinfang


1 Answers

Seems like I forgot to update this post.

This was fixed long back in commit #1039 of v1.8.11. From NEWS:

Fixed #5106 where DT[, .N, by=y] where y is a vector with length(y) = nrow(DT), but y is not a column in DT. Thanks to colinfang for reporting.

Testing on v1.8.11 commit 1187:

require(data.table)
test <- data.table(x=sample.int(10, 1000000, replace=TRUE))
y <- test$x

system.time(ans1 <- test[,.N, by=x])
#   user  system elapsed 
#  0.015   0.000   0.016 

system.time(ans2 <- test[,.N, by=y])
#   user  system elapsed 
#  0.015   0.000   0.015 

setnames(ans2, "y", "x")
identical(ans1, ans2) # [1] TRUE
like image 76
Arun Avatar answered Nov 04 '22 09:11

Arun