Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table: performance of binary search VS vector scan

I was looking at the best way to subset on a data.table defined as follows:

library(data.table)
library(microbenchmark)

set.seed(2L)
N = 1e7L
DT = data.table(x = sample(letters, N, TRUE),
                y = sample(1000L, N, TRUE),
                val = runif(N))
setkey(DT, x, y)

There is the binary search (SUBSET1) and also the 'vector scan way' (SUBSET2).

SUBSET1 <- function(){
  a <- DT[.(c("a"), c(5L)), .N, nomatch = NULL]
}
SUBSET2 <- function(){
  a <- DT[ x == "a" & y == 5L, .N, nomatch = NULL]
}

What I quite like with the 'vector scan way' is that it is really self-explanatory and very readable. Nevertheless, it seems to be 2 times slower compared to the native binary search way.

microbenchmark(SUBSET1(), 
               SUBSET2(), 
               times = 500 )
  Unit: milliseconds
        expr    min      lq     mean  median     uq      max neval
   SUBSET1() 1.0328 1.27790 1.878415 1.53370 1.8924  20.5789   500
   SUBSET2() 2.4896 3.06665 4.476864 3.52685 4.3682 179.1607   500

My question
I don't understand why SUBSET2 is slower. Is it because there is a kind of internal conversion from 'vector scan way' to binary search or because 'vector scan way' is executed as such (and thus slower than binary search)?

like image 760
Cédric Guilmin Avatar asked May 03 '20 15:05

Cédric Guilmin


1 Answers

As pointed out by @jangorecki, both queries are already using the key -- the latter one just takes a small amount of extra time to map the "vector scan" form into the binary search form. You can see this with verbose=TRUE:

DT[ x == "a" & y == 5L, .N, nomatch = NULL, verbose = TRUE]

shows output:

Optimized subsetting with key 'x, y'
forder.c received 1 rows and 2 columns
forder took 0.001 sec
x is already ordered by these columns, no need to call reorder
i.x has same type (character) as x.x. No coercion needed.
i.y has same type (integer) as x.y. No coercion needed.
on= matches existing key, using key
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Detected that j uses these columns: <none> 

Compare with the direct binary search version:

DT[.("a", 5L), .N, nomatch = NULL, verbose = TRUE]
i.V1 has same type (character) as x.x. No coercion needed.
i.V2 has same type (integer) as x.y. No coercion needed.
on= matches existing key, using key
Starting bmerge ...
forder.c received 1 rows and 2 columns
bmerge done in 0.001s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Detected that j uses these columns: <none> 

But that's half as slow right? Also as pointed out, the time scale is very small. A more useful comparison is vs. the case when no key is used at all. Let's make an unsorted copy of your data:

DTrand = DT[sample(.N)]

Another quick aside -- we have to be careful for benchmarking because data.table is also doing some automatic optimizations to help sort your data even in this unsorted case:

DTrand[ x == "a" & y == 5L, .N, nomatch = NULL, verbose = TRUE]

Read the output carefully:

Creating new index 'y__x'
Creating index y__x done in ... forder.c received 10000000 rows and 3 columns
forder took 0.424 sec
0.286s elapsed (1.117s cpu) 
Optimized subsetting with index 'y__x'
forder.c received 1 rows and 2 columns
forder took 0.002 sec
x is already ordered by these columns, no need to call reorder
i.y has same type (integer) as x.y. No coercion needed.
i.x has same type (character) as x.x. No coercion needed.
on= matches existing index, using index
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.001s cpu) 
Reorder irows for 'mult=="all" && !allGrp1' ... forder.c received 360 rows and 2 columns
0.000s elapsed (0.002s cpu) 
Detected that j uses these columns: <none> 
[1] 360

data.table has automatically applied setindex to your table, which (while not as fast as physical sorting like with setkey), will nevertheless speed up any future subsets; simply repeating (as would happen with a benchmark):

DTrand[ x == "a" & y == 5L, .N, nomatch = NULL, verbose = TRUE]

Notice the similarity vs the keyed case (swap key for index):

Optimized subsetting with index 'y__x'
forder.c received 1 rows and 2 columns
forder took 0 sec
x is already ordered by these columns, no need to call reorder
i.y has same type (integer) as x.y. No coercion needed.
i.x has same type (character) as x.x. No coercion needed.
on= matches existing index, using index
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Reorder irows for 'mult=="all" && !allGrp1' ... forder.c received 360 rows and 2 columns
0.001s elapsed (0.001s cpu) 
Detected that j uses these columns: <none> 
[1] 360

Thus, a naive benchmark even on DTrand wouldn't be a true comparison -- after the first benchmark run, the table will be indexed and subsequent subsets will use this & binary search. See the vignette on secondary indices for more details.

We can sidestep this and get a proper benchmark by setting the option datatable.auto.index to FALSE and resetting the existing index:

options(datatable.auto.index = FALSE)
setindex(DTrand, NULL)

Now data.table forgets how to sort DTrand by x and y and we can compare the binary search approach and true vector subsetting:

microbenchmark::microbenchmark(
  times = 50L,
  vector = DTrand[ x == "a" & y == 5L, .N, nomatch = NULL],
  binary = DT[     x == "a" & y == 5L, .N, nomatch = NULL]
)
# Unit: milliseconds
#    expr       min         lq       mean     median        uq        max neval
#  vector 101.43306 114.325340 134.154362 119.367909 128.05273 345.721296    50
#  binary   1.06033   1.160188   1.631119   1.367017   1.57334   5.508802    50

So while the straight-up approach using .() is twice as fast as the optimized approach using ==, == is still 100x faster than a true vector subset.

You might also benefit from the data.table benchmarking vignette

like image 93
MichaelChirico Avatar answered Nov 15 '22 05:11

MichaelChirico