Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will using %chin% to subset character columns of an auto-indexed data.table ever improve speed?

Tags:

r

data.table

TLDR: When using more recent versions of data.table that use auto-indexing, is there any benefit from using %chin% to subset a data.table on character columns?

In the past, using %chin% from data.table in place of %in% when subsetting on character vectors resulted in a significant speed-up. In newer versions of data.table, secondary indices are automatically created on non-key columns when sub-setting. The creation and usage of these indices appears to make any speed difference between %chin% and %in% irrelevant.

Going forward, are there any cases where using %chin% to subset a data.table will still improve speed, or can I just use %in% in the future?


Update: The conversation on PR#2494: Better subsetting optimization for compound queries seems to support an understanding that when evaluated in the data.table calling environment, the execution methods of %chin% have been fundamentally changed.

For cases where the column used to subset the table more than once, performance will be drastically increased by auto-indexing, but when only used a single time (and thus not benefiting from the time spent generating the index), turning auto-indexing off will sometimes give marginally faster results.

I'll leave this open for a couple days, but I may flesh this out into an answer for the sake of posterity.


Data Generation and Benchmarking

The data generated is made up of a randomly ordered combination of two unbalanced samples:

  • 10 million single character strings, 26 possible unique values
  • 1 million four character strings, 456,976 possible unique values

The intent here is to be representative of non-normal variables that are dominated by a few common values, but have numerous less common possibilities.

library(data.table)
library(microbenchmark)
set.seed(1234)

## Create a vector of 1 million 4 character strings
## with 456,976 possible unique values 
DiverseSize <- 1e6
Diverse <- paste0(sample(LETTERS,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE),
                  sample(letters,DiverseSize,replace = TRUE))

## Create a vector of 10 million single character strings
## with 26 possible unique values
CommonSize  <- 1e7
Common <-  sample(LETTERS,CommonSize,replace = TRUE)

## Mix them into a data.table column, "x"
DT1 <- data.table(x = sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE))
## Make a deep copy to run independent comparisons
DT2 <- copy(DT1)

Comparing %in% and %chin%

When executing outside of the data.table environment, we still get a significant speed-up by using %chin%.

microbenchmark(
  Outside_chin = length(which(DT1[["x"]] %chin% c("Matt"))),
  Outside_in   = length(which(DT2[["x"]] %in% c("Matt"))),
  times = 1
)

...

Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval
 Outside_chin 254.5967 254.5967 254.5967 254.5967 254.5967 254.5967     1
   Outside_in 476.2117 476.2117 476.2117 476.2117 476.2117 476.2117     1

Comparing %in% and %chin%with auto-indexing

## Benchmarking -------
## Turn off Indices
options(datatable.auto.index = FALSE)
options(datatable.use.index = FALSE)

## Run without indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

## Run Again
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

options(datatable.auto.index = TRUE)
options(datatable.use.index = TRUE)

## First run builds indices and takes longer
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

## Run again, benefiting from pre-built indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]

Profiling Results

When using ProfVis to analyze the run-time of each expression, the following is apparent:

  1. Without auto-indexing, %chin% is faster and run-time is similar the first and second time.
  2. With auto-indexing, run-time is basically the same whether %chin% or %in% is used.
  3. While the auto-indexing makes the first execution slightly slower, the second call evaluates so fast profvis can't even catch it consistently with a 10 ms sample interval

%chin% %in% comparison


I am currently running data.table version 1.10.5, built 2018-03-17 07:30:06 UTC.

like image 838
Matt Summersgill Avatar asked Mar 20 '18 13:03

Matt Summersgill


1 Answers

Providing an answer here for the sake of closure: I've done my best to skim through the relevant commits on Github, but there's certainly a possibility I'm missing some of the finer nuances. If any of the contributors would like to make some clarifications, I'd be happy to update here.

The conversation on PR#2494: Better subsetting optimization for compound queries seems to support an understanding that when evaluated in the data.table calling environment, the execution methods of %chin% have been fundamentally changed.

  1. When sub-setting a data.table in cases where the same column will be used to subset the table multiple times, performance can be drastically increased by auto-indexing, so stick with the default options: options(datatable.auto.index = TRUE) and options(datatable.use.index = TRUE). In this case, using %chin% will not result in any performance improvement.

  2. If performance is critical and you know that you will only be sub-setting on a given column a single time (and thus not benefiting from the time spent generating the index), turning auto-indexing off with options(datatable.auto.index = FALSE)and can give marginally faster results. You can still manually create indexes and keys as always if you will be performing repetitive subsets, but the burden of optimization will rest on the user to make appropriate use of setkey() and setindex(). When auto-indexing is turned off, using %chin% will be faster than %in%.

  3. When testing whether an element is present in a character vector outside of the data.table calling environment , %chin% is still faster than %in%

like image 73
Matt Summersgill Avatar answered Nov 15 '22 16:11

Matt Summersgill