Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance benefits of chaining over ANDing when filtering a data table

Tags:

r

data.table

I'm in the habit of clumping similar tasks together into a single line. For example, if I need to filter on a, b, and c in a data table, I'll put them together in one[] with ANDs. Yesterday, I noticed that in my particular case this was incredibly slow and tested chaining filters instead. I've included an example below.

First, I seed the random number generator, load data.table, and create a dummy data set.

# Set RNG seed
set.seed(-1)

# Load libraries
library(data.table)

# Create data table
dt <- data.table(a = sample(1:1000, 1e7, replace = TRUE),
                 b = sample(1:1000, 1e7, replace = TRUE),
                 c = sample(1:1000, 1e7, replace = TRUE),
                 d = runif(1e7))

Next, I define my methods. The first approach chains filters together. The second ANDs the filters together.

# Chaining method
chain_filter <- function(){
  dt[a %between% c(1, 10)
     ][b %between% c(100, 110)
       ][c %between% c(750, 760)]
}

# Anding method
and_filter <- function(){
  dt[a %between% c(1, 10) & b %between% c(100, 110) & c %between% c(750, 760)]
}

Here, I check they give the same results.

# Check both give same result
identical(chain_filter(), and_filter())
#> [1] TRUE

Finally, I benchmark them.

# Benchmark
microbenchmark::microbenchmark(chain_filter(), and_filter())
#> Unit: milliseconds
#>            expr      min        lq      mean    median        uq       max
#>  chain_filter() 25.17734  31.24489  39.44092  37.53919  43.51588  78.12492
#>    and_filter() 92.66411 112.06136 130.92834 127.64009 149.17320 206.61777
#>  neval cld
#>    100  a 
#>    100   b

Created on 2019-10-25 by the reprex package (v0.3.0)

In this case, chaining reduces run time by about 70%. Why is this the case? I mean, what's going on under the hood in data table? I haven't seen any warnings against using &, so I was surprised that the difference is so big. In both cases they evaluate the same conditions, so that shouldn't be a difference. In the AND case, & is a quick operator and then it only has to filter the data table once (i.e., using the logical vector resulting from the ANDs), as opposed to filtering three times in the chaining case.

Bonus question

Does this principle hold for data table operations in general? Is modularising tasks always a better strategy?

like image 986
Lyngbakr Avatar asked Oct 25 '19 11:10

Lyngbakr


Video Answer


1 Answers

Mostly, the answer was given in the comments aleady: the "chaining method" for data.table is faster in this case than the "anding method" as the chaining runs the conditions one after another. As each step reduces the size of the data.table there is less to evaluate for the next one. "Anding" evaluates the conditions for the full size data each time.

We can demonstrate this with an example: when the individual steps do NOT decrease the size of the data.table (i.e. the conditions to check are the same for both appraoches):

chain_filter <- function(){
  dt[a %between% c(1, 1000) # runs evaluation but does not filter out cases
     ][b %between% c(1, 1000)
       ][c %between% c(750, 760)]
}

# Anding method
and_filter <- function(){
  dt[a %between% c(1, 1000) & b %between% c(1, 1000) & c %between% c(750, 760)]
}

Using the same data but the bench package, which automatically checks if results are identical:

res <- bench::mark(
  chain = chain_filter(),
  and = and_filter()
)
summary(res)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 chain         299ms    307ms      3.26     691MB     9.78
#> 2 and           123ms    142ms      7.18     231MB     5.39
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 chain       2.43   2.16      1         2.99     1.82
#> 2 and         1      1         2.20      1        1

As you can see here the anding approach is 2.43 times faster in this case. That means chaining actually adds some overhead, suggesting that usually anding should be quicker. EXCEPT if conditions are reducing the size of the data.table step by step. Theoretically, the chaining approach could even be slower (even leaving the overhead aside), namely if a condition would increase the size of the data. But practically I think that's not possible since recycling of logical vectors is not allowed in data.table. I think this answers your bonus question.

For comparison, original functions on my machine with bench:

res <- bench::mark(
  chain = chain_filter_original(),
  and = and_filter_original()
)
summary(res)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 chain        29.6ms   30.2ms     28.5     79.5MB     7.60
#> 2 and         125.5ms  136.7ms      7.32   228.9MB     7.32
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 chain       1      1         3.89      1        1.04
#> 2 and         4.25   4.52      1         2.88     1
like image 110
JBGruber Avatar answered Sep 18 '22 07:09

JBGruber