Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subsetting a data.table in a for loop is slower and resource hungry

Working with the data.table R package, I noticed really high processor usage when running a simple for loop that would subset a dataset using values from another data.table. When i say high usage I mean 100% an all available threads for the entire time the loop is running.

The interesting part is that using a data.frame object for the same process takes 10x less time for the same output. And with only one of the cores at 100%.

Here is my hopefully reproducible example:

chr = c(rep(1, 1000), rep(2, 1000), rep(3, 1000), rep(3,1000))
start = rep(seq(from =1, to = 100000, by=100), 4)
end = start + 100

df1 <- data.frame(chr=chr, start=start, end=end)
df2 <- rbind(df1,df1,df1,df1,df1)
dt1 <- data.table::data.table(df1)
dt2 <- data.table::data.table(df2)

test1 <- list()
test2 <- list()

#loop subsetting a data.frame
system.time(
for (i in 1:nrow(df2)) {
  no.dim <- dim(df1[df1$chr == df2[i, 'chr'] & df1$start >= df2[i, 'start'] & df1$end <= df2[i, 'end'], ])[1]
  test1[i] <- no.dim
})

# loop subsetting a data.table using data.table syntax
system.time(
for (i in 1:nrow(dt2)) {
  no.dim <- dim(dt1[chr == dt2[i, chr] & start >= dt2[i, start] & end <= dt2[i, end], ])[1]
  test2[i] <- no.dim
})

# is the output the same
identical(test1, test2)

And this is the output:

> #loop subsetting a data.frame
> system.time(
+ for (i in 1:nrow(df2)) {
+   no.dim <- dim(df1[df1$chr == df2[i, 'chr'] & df1$start >= df2[i, 'start'] & df1$end <= df2[i, 'end'], ])[1]
+   test1[i] <- no.dim
+ })
   user  system elapsed 
  2.607   0.004   2.612 
> 
> # loop subsetting a data.table using data.table syntax
> system.time(
+ for (i in 1:nrow(dt2)) {
+   no.dim <- dim(dt1[chr == dt2[i, chr] & start >= dt2[i, start] & end <= dt2[i, end], ])[1]
+   test2[i] <- no.dim
+ })
   user  system elapsed 
192.632   0.152  24.398 
> 
> # is the output the same
> identical(test1, test2)
[1] TRUE

Now, I know there is probably multiple better and more efficient ways to perform the same task, and that I am probably not doing it the data.table way. But lets say for some reason you had a script using 'data.frame' objects and you wanted to quickly rewrite the thing to use data.table instead. The approach taken above seems perfectly plausible.

Can anyone reproduce the same situation regarding the slowdown and high processor usage? Is it somehow fixable by keeping more or less the same subsetting process or does it have to be rewritten completely to be used efficiently on data.table's?

PS: just tested it on a Windows machine and the thread usage was normal (one thread running at 100%), but still slower. Tested it on a system similar to mine gave the same results as above.

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=C           LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C        
 [5] LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=et_EE.UTF-8 LC_NAME=C           
 [9] LC_ADDRESS=C         LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.0

loaded via a namespace (and not attached):
 [1] compiler_3.5.1   assertthat_0.2.0 cli_1.0.1        tools_3.5.1      pillar_1.3.1    
 [6] rstudioapi_0.9.0 tibble_2.0.0     crayon_1.3.4     utf8_1.1.4       fansi_0.4.0     
[11] pkgconfig_2.0.2  rlang_0.3.1   

EDIT:

Thanks to everybody for their comments. It appears the slowdown issue is related to the overhead of [.data.table as detailed by @Hugh. The same issue was mentioned here efficient subsetting of data.table with greater-than, less-than using indices as pointed out by @denis.

The fix proposed by @Frank, while it really is efficient and produces similar output, changes the behavior of the process by removing the loop completely and adds a possibly unwanted column to the original dataset.

EDIT.1:

After my first edit @Frank added another approach which consists of computing a list column using data.table syntax. While its pretty neat, I must admit that I needed a while to figure out what is going on. I figured its just computing lm() on the start and end column of the subset data.table, so I tried reproducing the results using a for loop and data.frames. Timings:

> system.time({res <- dt1[dt2, on=.(chr, start >= start, end <= end), .(n = .N, my_lm = list(lm(x.start ~ x.end))), by=.EACHI][, .(n, my_lm)]; res <- as.list(res$my_lm)})
   user  system elapsed 
 11.538   0.003  11.336 
> 
> test_new <- list()
> system.time(
+   for (i in 1:20000) {
+     df_new <- df1[df1$chr == df2$chr[i] & df1$start >= df2$start[i] & df1$end <= df2$end[i],]
+     test_new[[i]] <- lm(df_new$start ~ df_new$end)
+   })
   user  system elapsed 
 12.377   0.048  12.425 
> 

As long as you have a bottleneck function as lm() you are better off (for control and readability sake) with a basic for loop, but using data.frames.

like image 803
MarChers Avatar asked Dec 14 '22 13:12

MarChers


1 Answers

Can anyone reproduce the same situation regarding the slowdown and high processor usage? Is it somehow fixable by keeping more or less the same subsetting process or does it have to be rewritten completely to be used efficiently on data.table's?

I get timings of 5 sec and 44 sec for the OP's two approaches (DF and DT, resp.), but...

system.time(
  dt2[, v := dt1[.SD, on=.(chr, start >= start, end <= end), .N, by=.EACHI]$N]
)
#    user  system elapsed 
#    0.03    0.01    0.03 
identical(dt2$v, unlist(test1))
# TRUE

But lets say for some reason you had a script using 'data.frame' objects and you wanted to quickly rewrite the thing to use data.table instead. The approach taken above seems perfectly plausible.

This is pretty quick to write once you're used to data.table syntax.


If you don't want to modify dt2 just take the vector directly...

res <- dt1[dt2, on=.(chr, start >= start, end <= end), .N, by=.EACHI]$N

For this example, a vector of row counts makes sense, but if you have a more complicated output that needs to be in a list, you can use a list column...

res <- dt1[dt2, on=.(chr, start >= start, end <= end), .(
  n = .N, 
  my_lm = list(lm(x.start ~ x.end))
), by=.EACHI][, .(n, my_lm)]

       n my_lm
    1: 1  <lm>
    2: 1  <lm>
    3: 1  <lm>
    4: 1  <lm>
    5: 1  <lm>
   ---        
19996: 2  <lm>
19997: 2  <lm>
19998: 2  <lm>
19999: 2  <lm>
20000: 2  <lm>
like image 75
Frank Avatar answered Dec 21 '22 23:12

Frank