Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance issue with SD[] when indexing by variable

Tags:

I am working with data.tables in R. The data has multiple records by id and I am trying to find the nth record for each individual using the .SD data.table option. If I specify N as an integer, the new data.table is created instantaneously. But if N is a variable (as it might be in a function), the code takes about 700 times longer. With large data sets, this is a problem. I was wondering if this is a known issue, and if there is any way to speed this up?

library(data.table)
library(microbenchmark)

set.seed(102938)

dd <- data.table(id = rep(1:10000, each = 10), seq = seq(1:10))
setkey(dd, id)

N <- 2
microbenchmark(dd[,.SD[2], keyby = id],
               dd[,.SD[N], keyby = id],
               times = 5)
#> Unit: microseconds
#>                      expr        min         lq       mean     median
#>  dd[, .SD[2], keyby = id]    886.269   1584.513   2904.497   1851.356
#>  dd[, .SD[N], keyby = id] 770822.875 810131.784 870418.622 903956.708
#>          uq        max neval
#>    1997.134   8203.214     5
#>  912223.026 954958.718     5
like image 425
kgoldfeld Avatar asked Jun 01 '19 17:06

kgoldfeld


People also ask

Does indexing reduce performance?

Yes, indexes can hurt performance for SELECTs. It is important to understand how database engines operate. Data is stored on disk(s) in "pages". Indexes make it possible to access the specific page that has a specific value in one or more columns in the table.

How the indexing affects the performance in database?

Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.

What kind of choices for a clustered index can lead to poor index performance with database queries?

A wrong index can be an index created on a column that doesn't provide easier data manipulation or an index created on multiple columns which instead of speeding up queries, slows them down. A table without a clustered index can also be considered as a poor indexing practice.

Why many indexes are not good for performance?

The reason that having to many indexes is a bad thing is that it dramatically increases the amount of writing that needs to be done to the table. This happens in a couple of different places. When a write happens the data first is logged to the transaction log.


1 Answers

It may be better to do the subsetting with row index (.I) instead of .SD

dd[dd[, .I[N], keyby = id]$V1]

-benchmarks

microbenchmark(dd[,.SD[2], keyby = id],
                dd[dd[,.I[N], keyby = id]$V1],
                times = 5)
#Unit: milliseconds
#                           expr      min       lq     mean   median       uq      max neval
#       dd[, .SD[2], keyby = id] 1.253097 1.343862 2.796684 1.352426 1.400910 8.633126     5
# dd[dd[, .I[N], keyby = id]$V1] 5.082752 5.383201 5.991076 5.866084 6.488898 7.134443     5

With .I, it got improved much better than .SD, but there is still a performance hit and it would be the search time in the global env for finding the variable 'N'


Internally, optimizations play a role in the timings. If we use, all optimizations FALSE by using the option 0

options(datatable.optimize = 0L)
microbenchmark(dd[,.SD[2], keyby = id],
             dd[dd[,.I[N], keyby = id]$V1],
             times = 5)
#Unit: milliseconds
#                          expr        min         lq      mean     median         uq        max neval
#      dd[, .SD[2], keyby = id] 660.612463 701.573252 761.51163 776.780341 785.940196 882.651875     5
#dd[dd[, .I[N], keyby = id]$V1]   3.860492   4.140469   5.05796   4.762518   5.342907   7.183416     5

Now, the .I method is faster

Changing to 1

options(datatable.optimize = 1L)
microbenchmark(dd[,.SD[2], keyby = id],
                 dd[dd[,.I[N], keyby = id]$V1],
                 times = 5)
#Unit: milliseconds
#                           expr      min       lq     mean   median       uq      max neval
#       dd[, .SD[2], keyby = id] 4.934761 5.109478 5.496449 5.414477 5.868185 6.155342     5
# dd[dd[, .I[N], keyby = id]$V1] 3.923388 3.966413 4.325268 4.379745 4.494367 4.862426     5

With 2 - gforce optimization - default method

options(datatable.optimize = 2L)
microbenchmark(dd[,.SD[2], keyby = id],
                 dd[dd[,.I[N], keyby = id]$V1],
                 times = 5)
#Unit: milliseconds
#                           expr      min       lq     mean   median       uq      max neval
#       dd[, .SD[2], keyby = id] 1.113463 1.179071 1.245787 1.205013 1.337216 1.394174     5
# dd[dd[, .I[N], keyby = id]$V1] 4.339619 4.523917 4.774221 4.833648 5.017755 5.156166     5

Behind the hood optimizations can be checked with verbose = TRUE

out1 <- dd[,.SD[2], keyby = id, verbose = TRUE]
#Finding groups using forderv ... 0.017s elapsed (0.020s cpu) 
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.022s #elapsed (0.131s cpu) 
#lapply optimization changed j from '.SD[2]' to 'list(seq[2])'
#GForce optimized j to 'list(`g[`(seq, 2))'
#Making each group and running j (GForce TRUE) ... 0.027s elapsed (0.159s cpu) 

out2 <- dd[dd[,.I[N], keyby = id, verbose = TRUE]$V1, verbose = TRUE]
#Detected that j uses these columns: <none> 
#Finding groups using forderv ... 0.023s elapsed (0.026s cpu) 
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.022s #elapsed (0.128s cpu) 
#lapply optimization is on, j unchanged as '.I[N]'
#GForce is on, left j unchanged
#Old mean optimization is on, left j unchanged.
#Making each group and running j (GForce FALSE) ... 
#  memcpy contiguous groups took 0.052s for 10000 groups
#  eval(j) took 0.065s for 10000 calls   #######
#0.068s elapsed (0.388s cpu) 
like image 157
akrun Avatar answered Sep 29 '22 06:09

akrun