Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does data.table notation for column selection affect speed

Tags:

r

data.table

There seem to be a difference in speed depending in how you specify the columns to be selected from a data.table: x[, .(var)] vs x[, c('var')]. The reason may be completely obvious, however in the help page .(), list() and c() notations seem to be used interchangeably. I work with quite large datasets, so it is a bit important to me :-)

Example (the order of call does not affect the speed):

x <- as.data.table(as.character(rnorm(20000000,1,0.5)))
setkey(x, V1)

tic(); x[, .(V1)]; toc()
25.08 sec elapsed


tic(); x[, c('V1')]; toc()
0.28 sec elapsed

tic(); x[, 1]; toc()
0.02 sec elapsed

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tictoc_1.0        data.table_1.12.8

loaded via a namespace (and not attached):
[1] compiler_3.6.1  tools_3.6.1     lifecycle_0.2.0 rlang_0.4.6  
like image 748
Jesper Avatar asked May 26 '20 09:05

Jesper


1 Answers

You have found a bug (issue filed here)-- data.table is trying to determine if the output of [] is keyed; in order to do so, it is running an internal is.sorted function. This is very slow on a huge table of unique strings.

Fortunately, we can do static analysis and realize that your output table is in fact keyed -- there's no subset, and the key column (V1) is unchanged. Therefore the sort order cannot have changed, and your output will also be sorted by V1.

This logic is built in to a PR to fix this issue -- you can test it out with remotes::install_github('Rdatatable/data.table@fix_sorting_on_sorted'), with the caveat that this is a bleeding edge version of the package, or you can wait till it's merged to master, or until a new version is released to CRAN.

In the meantime, here's a workaround:

setkey(x, NULL)
system.time(x[ , .(V1)])
#    user  system elapsed 
#   0.120   0.087   0.213

Of course this blocks later processing from recognizing that your data is sorted & the efficiencies thereto...

In this case (!and this case only -- use with care!!!) -- where you are yourself certain that the data is already sorted by V1 -- you can restore the key instantly with:

setattr(x, 'sorted', 'V1')

More generally there are small differences among selection with [, [[, $, etc. [ will tend to be the slowest since we do a lot of "static query analysis" to help improve the efficiency of your code, which comes with a performance cost which we hope will be small almost every time. Anytime this cost is not small, it should be a bug. There is also some work being done actively to try and offer shortcuts to reduce this overhead, see for example this PR

like image 64
MichaelChirico Avatar answered Oct 24 '22 19:10

MichaelChirico