Performance issue with SD[] when indexing by variable

Tags:

I am working with data.tables in R. The data has multiple records by id and I am trying to find the nth record for each individual using the .SD data.table option. If I specify N as an integer, the new data.table is created instantaneously. But if N is a variable (as it might be in a function), the code takes about 700 times longer. With large data sets, this is a problem. I was wondering if this is a known issue, and if there is any way to speed this up?

library(data.table)
library(microbenchmark)

set.seed(102938)

dd <- data.table(id = rep(1:10000, each = 10), seq = seq(1:10))
setkey(dd, id)

N <- 2
microbenchmark(dd[,.SD[2], keyby = id],
               dd[,.SD[N], keyby = id],
               times = 5)

#> Unit: microseconds
#>                      expr        min         lq       mean     median
#>  dd[, .SD[2], keyby = id]    886.269   1584.513   2904.497   1851.356
#>  dd[, .SD[N], keyby = id] 770822.875 810131.784 870418.622 903956.708
#>          uq        max neval
#>    1997.134   8203.214     5
#>  912223.026 954958.718     5

425

asked Jun 01 '19 17:06

kgoldfeld

1 Answers

It may be better to do the subsetting with row index (.I) instead of .SD

dd[dd[, .I[N], keyby = id]$V1]

-benchmarks

microbenchmark(dd[,.SD[2], keyby = id],
                dd[dd[,.I[N], keyby = id]$V1],
                times = 5)
#Unit: milliseconds
#                           expr      min       lq     mean   median       uq      max neval
#       dd[, .SD[2], keyby = id] 1.253097 1.343862 2.796684 1.352426 1.400910 8.633126     5
# dd[dd[, .I[N], keyby = id]$V1] 5.082752 5.383201 5.991076 5.866084 6.488898 7.134443     5

With .I, it got improved much better than .SD, but there is still a performance hit and it would be the search time in the global env for finding the variable 'N'

Internally, optimizations play a role in the timings. If we use, all optimizations FALSE by using the option 0

options(datatable.optimize = 0L)
microbenchmark(dd[,.SD[2], keyby = id],
             dd[dd[,.I[N], keyby = id]$V1],
             times = 5)
#Unit: milliseconds
#                          expr        min         lq      mean     median         uq        max neval
#      dd[, .SD[2], keyby = id] 660.612463 701.573252 761.51163 776.780341 785.940196 882.651875     5
#dd[dd[, .I[N], keyby = id]$V1]   3.860492   4.140469   5.05796   4.762518   5.342907   7.183416     5

Now, the .I method is faster

Changing to 1

options(datatable.optimize = 1L)
microbenchmark(dd[,.SD[2], keyby = id],
                 dd[dd[,.I[N], keyby = id]$V1],
                 times = 5)
#Unit: milliseconds
#                           expr      min       lq     mean   median       uq      max neval
#       dd[, .SD[2], keyby = id] 4.934761 5.109478 5.496449 5.414477 5.868185 6.155342     5
# dd[dd[, .I[N], keyby = id]$V1] 3.923388 3.966413 4.325268 4.379745 4.494367 4.862426     5

With 2 - gforce optimization - default method

options(datatable.optimize = 2L)
microbenchmark(dd[,.SD[2], keyby = id],
                 dd[dd[,.I[N], keyby = id]$V1],
                 times = 5)
#Unit: milliseconds
#                           expr      min       lq     mean   median       uq      max neval
#       dd[, .SD[2], keyby = id] 1.113463 1.179071 1.245787 1.205013 1.337216 1.394174     5
# dd[dd[, .I[N], keyby = id]$V1] 4.339619 4.523917 4.774221 4.833648 5.017755 5.156166     5

Behind the hood optimizations can be checked with verbose = TRUE

out1 <- dd[,.SD[2], keyby = id, verbose = TRUE]
#Finding groups using forderv ... 0.017s elapsed (0.020s cpu) 
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.022s #elapsed (0.131s cpu) 
#lapply optimization changed j from '.SD[2]' to 'list(seq[2])'
#GForce optimized j to 'list(`g[`(seq, 2))'
#Making each group and running j (GForce TRUE) ... 0.027s elapsed (0.159s cpu) 

out2 <- dd[dd[,.I[N], keyby = id, verbose = TRUE]$V1, verbose = TRUE]
#Detected that j uses these columns: <none> 
#Finding groups using forderv ... 0.023s elapsed (0.026s cpu) 
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.022s #elapsed (0.128s cpu) 
#lapply optimization is on, j unchanged as '.I[N]'
#GForce is on, left j unchanged
#Old mean optimization is on, left j unchanged.
#Making each group and running j (GForce FALSE) ... 
#  memcpy contiguous groups took 0.052s for 10000 groups
#  eval(j) took 0.065s for 10000 calls   #######
#0.068s elapsed (0.388s cpu)

157

answered Sep 29 '22 06:09

akrun

Related questions
                            
                                Most efficient way to remove last element of list?
                            
                                Unable to Use Fira Code Font Ligature in Sublime Text 3 on Windows 7
                            
                                Spying on a non-exported node.js function using jest not working as expected
                            
                                Run sqlcmd in Dockerfile
                            
                                Having a 404 for each site in a Wagtail multisite setup
                            
                                Continous alphabetic list in python and getting every value of it
                            
                                Node modules not found with Dockerfile and docker-compose due to bind mount
                            
                                Create new column based on index of values in another sorted list
                            
                                How to animate an element where the trigger is a data change on the property you want to animate
                            
                                "Invalid ELF header" using libxmljs on AWS Lambda
                            
                                Custom Component injected into table row
                            
                                firrtl.Driver is deprecated - but what should we use instead?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With