I am trying to find the quickest way to subset a large dataset by several numeric columns. As promised by data.table, the time taken to do binary search is much quicker than for vector scanning. Binary search, however, requires setkey to be performed beforehand. As you see in this code, it takes an exceptionally long time! Once you take that time into account, vector scanning is much much faster:
set.seed(1)
n=10^7
nums <- round(runif(n,0,10000))
DT = data.table(s=sample(nums,n), exp=sample(nums,n),
init=sample(nums,n), contval=sample(nums,n))
this_s = DT[0.5*n,s]
this_exp = DT[0.5*n,exp]
this_init = DT[0.5*n,init]
system.time(ans1<-DT[s==this_s&exp==this_exp&init==this_init,4,with=FALSE])
# user system elapsed
# 0.65 0.01 0.67
system.time(setkey(DT,s,exp,init))
# user system elapsed
# 41.56 0.03 41.59
system.time(ans2<-DT[J(this_s,this_exp,this_init),4,with=FALSE])
# user system elapsed
# 0 0 0
identical(ans1,ans2)
# [1] TRUE
Am I doing something wrong? I've read through the data.table FAQs etc. Any help would be greatly appreciated.
Many thanks.
The line :
nums <- round(runif(n,0,10000))
leaves nums
as type numeric
not integer
. That makes a big difference. The data.table FAQs and introduction are geared towards integer
and character
columns; you won't see setkey
as slow on those types. For example :
nums <- as.integer(round(runif(n,0,10000)))
...
setkey(DT,s,exp,init) # much faster now
Two further points though ...
First, the ordering/sorting operations are much faster in the current development version of data.table v1.8.11. @jihoward is right on about sorting on numeric columns being much more time-consuming operation. But, still it's about 5-8x faster in 1.8.11 (because of a 6-pass radix order implementation, check this post). Comparing the time taken for the setkey
operation between 1.8.10 and 1.8.11:
# v 1.8.11
system.time(setkey(DT,s,exp,init))
# user system elapsed
# 8.358 0.375 8.844
# v 1.8.10
system.time(setkey(DT,s,exp,init))
# user system elapsed
# 66.609 0.489 75.216
It's a 8.5x improvement on my system. So, my guess is this'd take about 4.9 seconds on yours.
Second, as @Roland mentions, if your objective is to perform a couple of subsetting and that is ALL you're going to do, then of course it doesn't make sense to do a setkey as, it has to find the order of columns and then reorder the entire data.table (by reference so that the memory footprint is very minimal, check this post for more on setkey).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With