I am working with data.tables in R. The data has multiple records by id and I am trying to find the nth record for each individual using the .SD data.table option. If I specify N as an integer, the new data.table is created instantaneously. But if N is a variable (as it might be in a function), the code takes about 700 times longer. With large data sets, this is a problem. I was wondering if this is a known issue, and if there is any way to speed this up?
library(data.table)
library(microbenchmark)
set.seed(102938)
dd <- data.table(id = rep(1:10000, each = 10), seq = seq(1:10))
setkey(dd, id)
N <- 2
microbenchmark(dd[,.SD[2], keyby = id],
dd[,.SD[N], keyby = id],
times = 5)
#> Unit: microseconds
#> expr min lq mean median
#> dd[, .SD[2], keyby = id] 886.269 1584.513 2904.497 1851.356
#> dd[, .SD[N], keyby = id] 770822.875 810131.784 870418.622 903956.708
#> uq max neval
#> 1997.134 8203.214 5
#> 912223.026 954958.718 5
Yes, indexes can hurt performance for SELECTs. It is important to understand how database engines operate. Data is stored on disk(s) in "pages". Indexes make it possible to access the specific page that has a specific value in one or more columns in the table.
Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.
A wrong index can be an index created on a column that doesn't provide easier data manipulation or an index created on multiple columns which instead of speeding up queries, slows them down. A table without a clustered index can also be considered as a poor indexing practice.
The reason that having to many indexes is a bad thing is that it dramatically increases the amount of writing that needs to be done to the table. This happens in a couple of different places. When a write happens the data first is logged to the transaction log.
It may be better to do the subsetting with row index (.I
) instead of .SD
dd[dd[, .I[N], keyby = id]$V1]
-benchmarks
microbenchmark(dd[,.SD[2], keyby = id],
dd[dd[,.I[N], keyby = id]$V1],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dd[, .SD[2], keyby = id] 1.253097 1.343862 2.796684 1.352426 1.400910 8.633126 5
# dd[dd[, .I[N], keyby = id]$V1] 5.082752 5.383201 5.991076 5.866084 6.488898 7.134443 5
With .I
, it got improved much better than .SD
, but there is still a performance hit and it would be the search time in the global env for finding the variable 'N'
Internally, optimizations play a role in the timings. If we use, all optimizations FALSE by using the option 0
options(datatable.optimize = 0L)
microbenchmark(dd[,.SD[2], keyby = id],
dd[dd[,.I[N], keyby = id]$V1],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dd[, .SD[2], keyby = id] 660.612463 701.573252 761.51163 776.780341 785.940196 882.651875 5
#dd[dd[, .I[N], keyby = id]$V1] 3.860492 4.140469 5.05796 4.762518 5.342907 7.183416 5
Now, the .I
method is faster
Changing to 1
options(datatable.optimize = 1L)
microbenchmark(dd[,.SD[2], keyby = id],
dd[dd[,.I[N], keyby = id]$V1],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dd[, .SD[2], keyby = id] 4.934761 5.109478 5.496449 5.414477 5.868185 6.155342 5
# dd[dd[, .I[N], keyby = id]$V1] 3.923388 3.966413 4.325268 4.379745 4.494367 4.862426 5
With 2 - gforce optimization - default method
options(datatable.optimize = 2L)
microbenchmark(dd[,.SD[2], keyby = id],
dd[dd[,.I[N], keyby = id]$V1],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dd[, .SD[2], keyby = id] 1.113463 1.179071 1.245787 1.205013 1.337216 1.394174 5
# dd[dd[, .I[N], keyby = id]$V1] 4.339619 4.523917 4.774221 4.833648 5.017755 5.156166 5
Behind the hood optimizations can be checked with verbose = TRUE
out1 <- dd[,.SD[2], keyby = id, verbose = TRUE]
#Finding groups using forderv ... 0.017s elapsed (0.020s cpu)
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.022s #elapsed (0.131s cpu)
#lapply optimization changed j from '.SD[2]' to 'list(seq[2])'
#GForce optimized j to 'list(`g[`(seq, 2))'
#Making each group and running j (GForce TRUE) ... 0.027s elapsed (0.159s cpu)
out2 <- dd[dd[,.I[N], keyby = id, verbose = TRUE]$V1, verbose = TRUE]
#Detected that j uses these columns: <none>
#Finding groups using forderv ... 0.023s elapsed (0.026s cpu)
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.022s #elapsed (0.128s cpu)
#lapply optimization is on, j unchanged as '.I[N]'
#GForce is on, left j unchanged
#Old mean optimization is on, left j unchanged.
#Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.052s for 10000 groups
# eval(j) took 0.065s for 10000 calls #######
#0.068s elapsed (0.388s cpu)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With