Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pass expressions to function to evaluate within data.table to allow for internal optimisation

Tags:

r

data.table

Preread

I went through some material here on SO:

and after getting a perfect answer to my previous problem, I am trying to once and for all get my head around how to canonically deal with data.tables in functions.

Underlying Problem

What I eventually want is to create a function which takes some R expressions as inputs and evaluates them in the context of a data.table (both in the i as well as in the j part). The quoted answers tell me that I have to use some get/eval/substitute combination if my inputs become more complicated than just a single column (in which case I could live with the ..string or the with = FALSE approach [1]).

My real data is rather big, so I am concerned about computational time.

Ultimately, if I want to have full flexibility (that is passing in expressions rather than bare column names), I understood that I have to go for an eval approach:

Codes speaks a thousand words, so let's illustrate what I found out so far:

Setup

library(data.table)
iris <- copy(iris)
setDT(iris)

Workhorse Function

my_fun <- function(my_i, my_j, option_sel = 1, my_data = iris, by = NULL) {
   switch(option_sel,
      {
         ## option 1 - base R deparse
         my_data[eval(parse(text = deparse(substitute(my_i)))), 
                 eval(parse(text = deparse(substitute(my_j)))),
                 by]
      },
      {
         ## option 2 - base R even shorter
         my_data[eval(substitute(my_i)), 
                 eval(substitute(my_j)),
                 by]

      },
      {
         ## option 3 - rlang
         my_data[rlang::eval_tidy(rlang::enexpr(my_i)),
                 rlang::eval_tidy(rlang::enexpr(my_j), data = .SD),
                 by]

      },
      {
         ## option 4 - if passing only simple column name strings
         ## we can use `with` (in j only)
         my_data[,
                 my_j, with = FALSE,
                 by]

      },
      {
         ## option 5 - if passing only simple column name strings 
         ## we can use ..syntax (in 'j' only)
         my_data[,
                 ..my_j]
                 # , by] ## would give a strange error

      },
      {
         ## option 6 - if passing only simple column name strings
         ## we can use `get`
         my_data[,
                 setNames(.(get(my_j)), my_j),
                 by]

      }
   )
}

Results

## added the unnecessary NULL to enforce same format
## did not want to make complicated ifs for by in the func 
## but by is needed for meaningful benchmarks later
expected <- iris[Species == "setosa", sum(Sepal.Length), NULL]
sapply(1:3, function(i) 
               isTRUE(all.equal(expected,
                                my_fun(Species == "setosa", sum(Sepal.Length), i))))
# [1] TRUE TRUE TRUE

expected <- iris[, .(Sepal.Length), NULL]
sapply(4:6, function(i)
               isTRUE(all.equal(expected,
                                my_fun(my_j = "Sepal.Length", option_sel = i))))
# [1] TRUE TRUE TRUE

Questions

All of the options work but while creating this (admittedly not so) minimal example I had a couple of questions:

  1. To profit the most from data.table, I have to use code which the internal optimizer can profile and, well, optimize [2]. So which of the options 1-3 (4-6 are only here for completeness and lack full flexibility) works "best" with data.table, that is which of these can be internally optimized to take full benefit from data.table? My quick benchmarks showed that the rlang option seems to be the fastest.
  2. I realized that for option 3 I have to provide .SD as data argument in the j part, but not in the i part. This is due to scoping that much is clear. But why does tidy_eval "see" the column names in i but not in j?
  3. Any other solution which can be even optimized further?
  4. Using by with option 5 results in a strange error. Why?

Benchmarks

library(dplyr)
size     <- c(1e6, 1e7, 1e8)
grp_prop <- c(1e-6, 1e-4)

make_bench_dat <- function(size, grp_prop) {
   data.table(x = seq_len(size),
              g = sample(ceiling(size * grp_prop), size, grp_prop < 1))
}

res <- bench::press(
   size = size,
   grp_prop = grp_prop,
   {
      bench_dat <- make_bench_dat(size, grp_prop)
      bench::mark(
         deparse    = my_fun(TRUE, max(x), 1, bench_dat, by = "g"),
         substitute = my_fun(TRUE, max(x), 2, bench_dat, by = "g"),
         rlang      = my_fun(TRUE, max(x), 3, bench_dat, by = "g"), 
         relative = TRUE)
   }
)

summary(res) %>% select(expression, size, grp_prop, min, median)
# # A tibble: 18 x 5
#    expression      size grp_prop      min   median
#    <bch:expr>     <dbl>    <dbl> <bch:tm> <bch:tm>
#  1 deparse      1000000 0.000001  22.73ms  24.36ms
#  2 substitute   1000000 0.000001  22.56ms   25.3ms
#  3 rlang        1000000 0.000001   8.09ms   9.05ms
#  4 deparse     10000000 0.000001 274.24ms 308.72ms
#  5 substitute  10000000 0.000001 276.73ms 276.99ms
#  6 rlang       10000000 0.000001 114.52ms 119.21ms
#  7 deparse    100000000 0.000001    3.79s    3.79s
#  8 substitute 100000000 0.000001    3.92s    3.92s
#  9 rlang      100000000 0.000001    3.12s    3.12s
# 10 deparse      1000000 0.0001    29.57ms  36.25ms
# 11 substitute   1000000 0.0001    37.22ms  41.56ms
# 12 rlang        1000000 0.0001     19.3ms  24.07ms
# 13 deparse     10000000 0.0001   386.13ms 396.84ms
# 14 substitute  10000000 0.0001   330.22ms 332.42ms
# 15 rlang       10000000 0.0001   270.54ms 274.35ms
# 16 deparse    100000000 0.0001      4.51s    4.51s
# 17 substitute 100000000 0.0001       4.1s     4.1s
# 18 rlang      100000000 0.0001      2.87s    2.87s

[1] with = FALSEor ..columnName does however work only in the j part.

[2] I learned that the hard way when I got a significant performance boost when I replaced purrr::map by base::lapply.

like image 703
thothal Avatar asked May 27 '20 09:05

thothal


1 Answers

No need for fancy tools, just use base R metaprogramming features.

my_fun2 = function(my_i, my_j, by, my_data) {
  dtq = substitute(
    my_data[.i, .j, .by],
    list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by))
  )
  print(dtq)
  eval(dtq)
}

my_fun2(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
my_fun2(my_j = "Sepal.Length", my_data=as.data.table(iris))

This way you can be sure that data.table will use all possible optimizations as when typing [ call by hand.


Note that in data.table we are planning to make substitution easier, see solution proposed in PR Rdatatable/data.table#4304.

Then using extra env var substitute will be handled internally for you

my_fun3 = function(my_i, my_j, by, my_data) {
  my_data[.i, .j, .by, env=list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by)), verbose=TRUE]
}
my_fun3(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
#Argument 'j'  after substitute: sum(Sepal.Length)
#Argument 'i'  after substitute: Species == "setosa"
#...
my_fun3(my_j = "Sepal.Length", my_data=as.data.table(iris))
#Argument 'j'  after substitute: Sepal.Length
#...
like image 93
jangorecki Avatar answered Sep 18 '22 02:09

jangorecki