Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table: vector scan v binary search with numeric columns - super-slow setkey

Tags:

r

data.table

I am trying to find the quickest way to subset a large dataset by several numeric columns. As promised by data.table, the time taken to do binary search is much quicker than for vector scanning. Binary search, however, requires setkey to be performed beforehand. As you see in this code, it takes an exceptionally long time! Once you take that time into account, vector scanning is much much faster:

set.seed(1)
n=10^7
nums <- round(runif(n,0,10000))
DT = data.table(s=sample(nums,n), exp=sample(nums,n), 
         init=sample(nums,n), contval=sample(nums,n))
this_s = DT[0.5*n,s] 
this_exp = DT[0.5*n,exp]
this_init = DT[0.5*n,init]
system.time(ans1<-DT[s==this_s&exp==this_exp&init==this_init,4,with=FALSE])
#   user  system elapsed 
#   0.65    0.01    0.67 
system.time(setkey(DT,s,exp,init))
#   user  system elapsed 
#  41.56    0.03   41.59 
system.time(ans2<-DT[J(this_s,this_exp,this_init),4,with=FALSE])
#   user  system elapsed 
#    0       0       0 
identical(ans1,ans2)
# [1] TRUE

Am I doing something wrong? I've read through the data.table FAQs etc. Any help would be greatly appreciated.

Many thanks.

like image 498
mww Avatar asked Dec 21 '13 14:12

mww


1 Answers

The line :

nums <- round(runif(n,0,10000))

leaves nums as type numeric not integer. That makes a big difference. The data.table FAQs and introduction are geared towards integer and character columns; you won't see setkey as slow on those types. For example :

nums <- as.integer(round(runif(n,0,10000)))
...
setkey(DT,s,exp,init)  # much faster now

Two further points though ...

First, the ordering/sorting operations are much faster in the current development version of data.table v1.8.11. @jihoward is right on about sorting on numeric columns being much more time-consuming operation. But, still it's about 5-8x faster in 1.8.11 (because of a 6-pass radix order implementation, check this post). Comparing the time taken for the setkey operation between 1.8.10 and 1.8.11:

# v 1.8.11
system.time(setkey(DT,s,exp,init))
#    user  system elapsed 
#   8.358   0.375   8.844 

# v 1.8.10
system.time(setkey(DT,s,exp,init))
#   user  system elapsed 
# 66.609   0.489  75.216 

It's a 8.5x improvement on my system. So, my guess is this'd take about 4.9 seconds on yours.

Second, as @Roland mentions, if your objective is to perform a couple of subsetting and that is ALL you're going to do, then of course it doesn't make sense to do a setkey as, it has to find the order of columns and then reorder the entire data.table (by reference so that the memory footprint is very minimal, check this post for more on setkey).

like image 196
Arun Avatar answered Nov 18 '22 10:11

Arun