TLDR: When using more recent versions of
data.table
that use auto-indexing, is there any benefit from using%chin%
to subset a data.table on character columns?
In the past, using %chin%
from data.table
in place of %in%
when subsetting on character vectors resulted in a significant speed-up. In newer versions of data.table
, secondary indices are automatically created on non-key columns when sub-setting. The creation and usage of these indices appears to make any speed difference between %chin%
and %in%
irrelevant.
Going forward, are there any cases where using %chin%
to subset a data.table will still improve speed, or can I just use %in%
in the future?
Update: The conversation on PR#2494: Better subsetting optimization for compound queries seems to support an understanding that when evaluated in the data.table calling environment, the execution methods of %chin%
have been fundamentally changed.
For cases where the column used to subset the table more than once, performance will be drastically increased by auto-indexing, but when only used a single time (and thus not benefiting from the time spent generating the index), turning auto-indexing off will sometimes give marginally faster results.
I'll leave this open for a couple days, but I may flesh this out into an answer for the sake of posterity.
The data generated is made up of a randomly ordered combination of two unbalanced samples:
The intent here is to be representative of non-normal variables that are dominated by a few common values, but have numerous less common possibilities.
library(data.table)
library(microbenchmark)
set.seed(1234)
## Create a vector of 1 million 4 character strings
## with 456,976 possible unique values
DiverseSize <- 1e6
Diverse <- paste0(sample(LETTERS,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE),
sample(letters,DiverseSize,replace = TRUE))
## Create a vector of 10 million single character strings
## with 26 possible unique values
CommonSize <- 1e7
Common <- sample(LETTERS,CommonSize,replace = TRUE)
## Mix them into a data.table column, "x"
DT1 <- data.table(x = sample(c(Diverse,Common),size = CommonSize + DiverseSize, replace = FALSE))
## Make a deep copy to run independent comparisons
DT2 <- copy(DT1)
%in%
and %chin%
When executing outside of the data.table
environment, we still get a significant speed-up by using %chin%
.
microbenchmark(
Outside_chin = length(which(DT1[["x"]] %chin% c("Matt"))),
Outside_in = length(which(DT2[["x"]] %in% c("Matt"))),
times = 1
)
...
Unit: milliseconds
expr min lq mean median uq max neval
Outside_chin 254.5967 254.5967 254.5967 254.5967 254.5967 254.5967 1
Outside_in 476.2117 476.2117 476.2117 476.2117 476.2117 476.2117 1
%in%
and %chin%
with auto-indexing## Benchmarking -------
## Turn off Indices
options(datatable.auto.index = FALSE)
options(datatable.use.index = FALSE)
## Run without indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]
## Run Again
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]
options(datatable.auto.index = TRUE)
options(datatable.use.index = TRUE)
## First run builds indices and takes longer
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]
## Run again, benefiting from pre-built indices
DT2[x %chin% c("Matt"), .N]
DT1[x %in% c("Matt"), .N]
When using ProfVis to analyze the run-time of each expression, the following is apparent:
%chin%
is faster and run-time is similar the first and second time.%chin%
or %in%
is used.I am currently running data.table version 1.10.5, built 2018-03-17 07:30:06 UTC.
Providing an answer here for the sake of closure: I've done my best to skim through the relevant commits on Github, but there's certainly a possibility I'm missing some of the finer nuances. If any of the contributors would like to make some clarifications, I'd be happy to update here.
The conversation on PR#2494: Better subsetting optimization for compound queries seems to support an understanding that when evaluated in the data.table calling environment, the execution methods of %chin%
have been fundamentally changed.
When sub-setting a data.table
in cases where the same column will be used to subset the table multiple times, performance can be drastically increased by auto-indexing, so stick with the default options: options(datatable.auto.index = TRUE)
and options(datatable.use.index = TRUE)
. In this case, using %chin%
will not result in any performance improvement.
If performance is critical and you know that you will only be sub-setting on a given column a single time (and thus not benefiting from the time spent generating the index), turning auto-indexing off with options(datatable.auto.index = FALSE)
and can give marginally faster results. You can still manually create indexes and keys as always if you will be performing repetitive subsets, but the burden of optimization will rest on the user to make appropriate use of setkey()
and setindex()
. When auto-indexing is turned off, using %chin%
will be faster than %in%
.
When testing whether an element is present in a character vector outside of the data.table
calling environment , %chin%
is still faster than %in%
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With