Working with the data.table
R package, I noticed really high processor usage when running a simple for
loop that would subset a dataset using values from another data.table
. When i say high usage I mean 100% an all available threads for the entire time the loop is running.
The interesting part is that using a data.frame
object for the same process takes 10x less time for the same output. And with only one of the cores at 100%.
Here is my hopefully reproducible example:
chr = c(rep(1, 1000), rep(2, 1000), rep(3, 1000), rep(3,1000))
start = rep(seq(from =1, to = 100000, by=100), 4)
end = start + 100
df1 <- data.frame(chr=chr, start=start, end=end)
df2 <- rbind(df1,df1,df1,df1,df1)
dt1 <- data.table::data.table(df1)
dt2 <- data.table::data.table(df2)
test1 <- list()
test2 <- list()
#loop subsetting a data.frame
system.time(
for (i in 1:nrow(df2)) {
no.dim <- dim(df1[df1$chr == df2[i, 'chr'] & df1$start >= df2[i, 'start'] & df1$end <= df2[i, 'end'], ])[1]
test1[i] <- no.dim
})
# loop subsetting a data.table using data.table syntax
system.time(
for (i in 1:nrow(dt2)) {
no.dim <- dim(dt1[chr == dt2[i, chr] & start >= dt2[i, start] & end <= dt2[i, end], ])[1]
test2[i] <- no.dim
})
# is the output the same
identical(test1, test2)
And this is the output:
> #loop subsetting a data.frame
> system.time(
+ for (i in 1:nrow(df2)) {
+ no.dim <- dim(df1[df1$chr == df2[i, 'chr'] & df1$start >= df2[i, 'start'] & df1$end <= df2[i, 'end'], ])[1]
+ test1[i] <- no.dim
+ })
user system elapsed
2.607 0.004 2.612
>
> # loop subsetting a data.table using data.table syntax
> system.time(
+ for (i in 1:nrow(dt2)) {
+ no.dim <- dim(dt1[chr == dt2[i, chr] & start >= dt2[i, start] & end <= dt2[i, end], ])[1]
+ test2[i] <- no.dim
+ })
user system elapsed
192.632 0.152 24.398
>
> # is the output the same
> identical(test1, test2)
[1] TRUE
Now, I know there is probably multiple better and more efficient ways to perform the same task, and that I am probably not doing it the data.table
way. But lets say for some reason you had a script using 'data.frame' objects and you wanted to quickly rewrite the thing to use data.table
instead. The approach taken above seems perfectly plausible.
Can anyone reproduce the same situation regarding the slowdown and high processor usage? Is it somehow fixable by keeping more or less the same subsetting process or does it have to be rewritten completely to be used efficiently on data.table
's?
PS: just tested it on a Windows machine and the thread usage was normal (one thread running at 100%), but still slower. Tested it on a system similar to mine gave the same results as above.
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
locale:
[1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C LC_COLLATE=C
[5] LC_MONETARY=C LC_MESSAGES=C LC_PAPER=et_EE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.0
loaded via a namespace (and not attached):
[1] compiler_3.5.1 assertthat_0.2.0 cli_1.0.1 tools_3.5.1 pillar_1.3.1
[6] rstudioapi_0.9.0 tibble_2.0.0 crayon_1.3.4 utf8_1.1.4 fansi_0.4.0
[11] pkgconfig_2.0.2 rlang_0.3.1
EDIT:
Thanks to everybody for their comments. It appears the slowdown issue is related to the overhead of [.data.table
as detailed by @Hugh. The same issue was mentioned here efficient subsetting of data.table with greater-than, less-than using indices as pointed out by @denis.
The fix proposed by @Frank, while it really is efficient and produces similar output, changes the behavior of the process by removing the loop completely and adds a possibly unwanted column to the original dataset.
EDIT.1:
After my first edit @Frank added another approach which consists of computing a list column using data.table syntax. While its pretty neat, I must admit that I needed a while to figure out what is going on. I figured its just computing lm() on the start and end column of the subset data.table, so I tried reproducing the results using a for loop and data.frames. Timings:
> system.time({res <- dt1[dt2, on=.(chr, start >= start, end <= end), .(n = .N, my_lm = list(lm(x.start ~ x.end))), by=.EACHI][, .(n, my_lm)]; res <- as.list(res$my_lm)})
user system elapsed
11.538 0.003 11.336
>
> test_new <- list()
> system.time(
+ for (i in 1:20000) {
+ df_new <- df1[df1$chr == df2$chr[i] & df1$start >= df2$start[i] & df1$end <= df2$end[i],]
+ test_new[[i]] <- lm(df_new$start ~ df_new$end)
+ })
user system elapsed
12.377 0.048 12.425
>
As long as you have a bottleneck function as lm() you are better off (for control and readability sake) with a basic for loop, but using data.frames.
Can anyone reproduce the same situation regarding the slowdown and high processor usage? Is it somehow fixable by keeping more or less the same subsetting process or does it have to be rewritten completely to be used efficiently on data.table's?
I get timings of 5 sec and 44 sec for the OP's two approaches (DF and DT, resp.), but...
system.time(
dt2[, v := dt1[.SD, on=.(chr, start >= start, end <= end), .N, by=.EACHI]$N]
)
# user system elapsed
# 0.03 0.01 0.03
identical(dt2$v, unlist(test1))
# TRUE
But lets say for some reason you had a script using 'data.frame' objects and you wanted to quickly rewrite the thing to use data.table instead. The approach taken above seems perfectly plausible.
This is pretty quick to write once you're used to data.table syntax.
If you don't want to modify dt2
just take the vector directly...
res <- dt1[dt2, on=.(chr, start >= start, end <= end), .N, by=.EACHI]$N
For this example, a vector of row counts makes sense, but if you have a more complicated output that needs to be in a list
, you can use a list
column...
res <- dt1[dt2, on=.(chr, start >= start, end <= end), .(
n = .N,
my_lm = list(lm(x.start ~ x.end))
), by=.EACHI][, .(n, my_lm)]
n my_lm
1: 1 <lm>
2: 1 <lm>
3: 1 <lm>
4: 1 <lm>
5: 1 <lm>
---
19996: 2 <lm>
19997: 2 <lm>
19998: 2 <lm>
19999: 2 <lm>
20000: 2 <lm>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With