I'm in the habit of clumping similar tasks together into a single line. For example, if I need to filter on a
, b
, and c
in a data table, I'll put them together in one[]
with ANDs. Yesterday, I noticed that in my particular case this was incredibly slow and tested chaining filters instead. I've included an example below.
First, I seed the random number generator, load data.table, and create a dummy data set.
# Set RNG seed
set.seed(-1)
# Load libraries
library(data.table)
# Create data table
dt <- data.table(a = sample(1:1000, 1e7, replace = TRUE),
b = sample(1:1000, 1e7, replace = TRUE),
c = sample(1:1000, 1e7, replace = TRUE),
d = runif(1e7))
Next, I define my methods. The first approach chains filters together. The second ANDs the filters together.
# Chaining method
chain_filter <- function(){
dt[a %between% c(1, 10)
][b %between% c(100, 110)
][c %between% c(750, 760)]
}
# Anding method
and_filter <- function(){
dt[a %between% c(1, 10) & b %between% c(100, 110) & c %between% c(750, 760)]
}
Here, I check they give the same results.
# Check both give same result
identical(chain_filter(), and_filter())
#> [1] TRUE
Finally, I benchmark them.
# Benchmark
microbenchmark::microbenchmark(chain_filter(), and_filter())
#> Unit: milliseconds
#> expr min lq mean median uq max
#> chain_filter() 25.17734 31.24489 39.44092 37.53919 43.51588 78.12492
#> and_filter() 92.66411 112.06136 130.92834 127.64009 149.17320 206.61777
#> neval cld
#> 100 a
#> 100 b
Created on 2019-10-25 by the reprex package (v0.3.0)
In this case, chaining reduces run time by about 70%. Why is this the case? I mean, what's going on under the hood in data table? I haven't seen any warnings against using &
, so I was surprised that the difference is so big. In both cases they evaluate the same conditions, so that shouldn't be a difference. In the AND case, &
is a quick operator and then it only has to filter the data table once (i.e., using the logical vector resulting from the ANDs), as opposed to filtering three times in the chaining case.
Does this principle hold for data table operations in general? Is modularising tasks always a better strategy?
Mostly, the answer was given in the comments aleady: the "chaining method" for data.table
is faster in this case than the "anding method" as the chaining runs the conditions one after another. As each step reduces the size of the data.table
there is less to evaluate for the next one. "Anding" evaluates the conditions for the full size data each time.
We can demonstrate this with an example: when the individual steps do NOT decrease the size of the data.table
(i.e. the conditions to check are the same for both appraoches):
chain_filter <- function(){
dt[a %between% c(1, 1000) # runs evaluation but does not filter out cases
][b %between% c(1, 1000)
][c %between% c(750, 760)]
}
# Anding method
and_filter <- function(){
dt[a %between% c(1, 1000) & b %between% c(1, 1000) & c %between% c(750, 760)]
}
Using the same data but the bench
package, which automatically checks if results are identical:
res <- bench::mark(
chain = chain_filter(),
and = and_filter()
)
summary(res)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 chain 299ms 307ms 3.26 691MB 9.78
#> 2 and 123ms 142ms 7.18 231MB 5.39
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 chain 2.43 2.16 1 2.99 1.82
#> 2 and 1 1 2.20 1 1
As you can see here the anding approach is 2.43 times faster in this case. That means chaining actually adds some overhead, suggesting that usually anding should be quicker. EXCEPT if conditions are reducing the size of the data.table
step by step. Theoretically, the chaining approach could even be slower (even leaving the overhead aside), namely if a condition would increase the size of the data. But practically I think that's not possible since recycling of logical vectors is not allowed in data.table
. I think this answers your bonus question.
For comparison, original functions on my machine with bench
:
res <- bench::mark(
chain = chain_filter_original(),
and = and_filter_original()
)
summary(res)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 chain 29.6ms 30.2ms 28.5 79.5MB 7.60
#> 2 and 125.5ms 136.7ms 7.32 228.9MB 7.32
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 chain 1 1 3.89 1 1.04
#> 2 and 4.25 4.52 1 2.88 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With