There seem to be a difference in speed depending in how you specify the columns to be selected from a data.table: x[, .(var)]
vs x[, c('var')]
.
The reason may be completely obvious, however in the help page .()
, list()
and c()
notations seem to be used interchangeably.
I work with quite large datasets, so it is a bit important to me :-)
Example (the order of call does not affect the speed):
x <- as.data.table(as.character(rnorm(20000000,1,0.5)))
setkey(x, V1)
tic(); x[, .(V1)]; toc()
25.08 sec elapsed
tic(); x[, c('V1')]; toc()
0.28 sec elapsed
tic(); x[, 1]; toc()
0.02 sec elapsed
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tictoc_1.0 data.table_1.12.8
loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1 lifecycle_0.2.0 rlang_0.4.6
You have found a bug (issue filed here)-- data.table
is trying to determine if the output of []
is keyed; in order to do so, it is running an internal is.sorted
function. This is very slow on a huge table of unique strings.
Fortunately, we can do static analysis and realize that your output table is in fact keyed -- there's no subset, and the key column (V1
) is unchanged. Therefore the sort order cannot have changed, and your output will also be sorted by V1
.
This logic is built in to a PR to fix this issue -- you can test it out with remotes::install_github('Rdatatable/data.table@fix_sorting_on_sorted')
, with the caveat that this is a bleeding edge version of the package, or you can wait till it's merged to master, or until a new version is released to CRAN.
In the meantime, here's a workaround:
setkey(x, NULL)
system.time(x[ , .(V1)])
# user system elapsed
# 0.120 0.087 0.213
Of course this blocks later processing from recognizing that your data is sorted & the efficiencies thereto...
In this case (!and this case only -- use with care!!!) -- where you are yourself certain that the data is already sorted by V1
-- you can restore the key instantly with:
setattr(x, 'sorted', 'V1')
More generally there are small differences among selection with [
, [[
, $
, etc. [
will tend to be the slowest since we do a lot of "static query analysis" to help improve the efficiency of your code, which comes with a performance cost which we hope will be small almost every time. Anytime this cost is not small, it should be a bug. There is also some work being done actively to try and offer shortcuts to reduce this overhead, see for example this PR
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With