This question was prompted by this problem.
Consider two vectors, a and b, and two data tables dt.a and dt.b as follows:
a <- c(55, 1:25)
b <- c(55,30:40)
library(data.table)
dt.a <- data.table(x=a,key="x")
dt.b <- data.table(x=b,key="x")
intersect(a,b)
[1] 55
dt.a[dt.b,nomatch=0]
    x
1: 55
The objective is to count the number of common elements.
My question is: why is data table join 30X slower than intersect(...)
system.time(for (i in 1:1000){intersect(a,b)})
   user  system elapsed 
   0.05    0.00    0.04 
system.time(for (i in 1:1000){dt.a[dt.b,nomatch=0]})
   user  system elapsed 
   1.68    0.00    1.69 
                The power of data.table shines when given a "big" problem. The overheads of [.data.table will dwarf the time actually spend on the binary search component.
If you give it a "big" problem, then data.table will scale and you will see the difference.
# a "bigger" problem
a <- c(55, 1:25e6)
b <- c(55,30:40e6)
library(data.table)
dt.a <- data.table(x=a,key="x")
dt.b <- data.table(x=b,key="x")
library(microbenchmark)
microbenchmark(intersect(a,b), dt.a[dt.b, nomatch=0],times=5)
## Unit: seconds
##                     expr      min       lq   median       uq      max neval
##          intersect(a, b) 6.848245 6.897009 6.962055 7.052095 7.058509     5
##  dt.a[dt.b, nomatch = 0] 3.629062 3.654269 3.685051 3.721983 3.815155     5
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With