Preread
I went through some material here on SO:
and after getting a perfect answer to my previous problem, I am trying to once and for all get my head around how to canonically deal with data.tables in functions.
Underlying Problem
What I eventually want is to create a function which takes some R expressions as inputs and evaluates them in the context of a data.table (both in the i as well as in the j part). The quoted answers tell me that I have to use some get/eval/substitute combination if my inputs become more complicated than just a single column (in which case I could live with the ..string or the with = FALSE approach [1]).
My real data is rather big, so I am concerned about computational time.
Ultimately, if I want to have full flexibility (that is passing in expressions rather than bare column names), I understood that I have to go for an eval approach:
Codes speaks a thousand words, so let's illustrate what I found out so far:
Setup
library(data.table)
iris <- copy(iris)
setDT(iris)
Workhorse Function
my_fun <- function(my_i, my_j, option_sel = 1, my_data = iris, by = NULL) {
switch(option_sel,
{
## option 1 - base R deparse
my_data[eval(parse(text = deparse(substitute(my_i)))),
eval(parse(text = deparse(substitute(my_j)))),
by]
},
{
## option 2 - base R even shorter
my_data[eval(substitute(my_i)),
eval(substitute(my_j)),
by]
},
{
## option 3 - rlang
my_data[rlang::eval_tidy(rlang::enexpr(my_i)),
rlang::eval_tidy(rlang::enexpr(my_j), data = .SD),
by]
},
{
## option 4 - if passing only simple column name strings
## we can use `with` (in j only)
my_data[,
my_j, with = FALSE,
by]
},
{
## option 5 - if passing only simple column name strings
## we can use ..syntax (in 'j' only)
my_data[,
..my_j]
# , by] ## would give a strange error
},
{
## option 6 - if passing only simple column name strings
## we can use `get`
my_data[,
setNames(.(get(my_j)), my_j),
by]
}
)
}
Results
## added the unnecessary NULL to enforce same format
## did not want to make complicated ifs for by in the func
## but by is needed for meaningful benchmarks later
expected <- iris[Species == "setosa", sum(Sepal.Length), NULL]
sapply(1:3, function(i)
isTRUE(all.equal(expected,
my_fun(Species == "setosa", sum(Sepal.Length), i))))
# [1] TRUE TRUE TRUE
expected <- iris[, .(Sepal.Length), NULL]
sapply(4:6, function(i)
isTRUE(all.equal(expected,
my_fun(my_j = "Sepal.Length", option_sel = i))))
# [1] TRUE TRUE TRUE
Questions
All of the options work but while creating this (admittedly not so) minimal example I had a couple of questions:
data.table, I have to use code which the internal optimizer can profile and, well, optimize [2]. So which of the options 1-3 (4-6 are only here for completeness and lack full flexibility) works "best" with data.table, that is which of these can be internally optimized to take full benefit from data.table? My quick benchmarks showed that the rlang option seems to be the fastest..SD as data argument in the j part, but not in the i part. This is due to scoping that much is clear. But why does tidy_eval "see" the column names in i but not in j?Benchmarks
library(dplyr)
size <- c(1e6, 1e7, 1e8)
grp_prop <- c(1e-6, 1e-4)
make_bench_dat <- function(size, grp_prop) {
data.table(x = seq_len(size),
g = sample(ceiling(size * grp_prop), size, grp_prop < 1))
}
res <- bench::press(
size = size,
grp_prop = grp_prop,
{
bench_dat <- make_bench_dat(size, grp_prop)
bench::mark(
deparse = my_fun(TRUE, max(x), 1, bench_dat, by = "g"),
substitute = my_fun(TRUE, max(x), 2, bench_dat, by = "g"),
rlang = my_fun(TRUE, max(x), 3, bench_dat, by = "g"),
relative = TRUE)
}
)
summary(res) %>% select(expression, size, grp_prop, min, median)
# # A tibble: 18 x 5
# expression size grp_prop min median
# <bch:expr> <dbl> <dbl> <bch:tm> <bch:tm>
# 1 deparse 1000000 0.000001 22.73ms 24.36ms
# 2 substitute 1000000 0.000001 22.56ms 25.3ms
# 3 rlang 1000000 0.000001 8.09ms 9.05ms
# 4 deparse 10000000 0.000001 274.24ms 308.72ms
# 5 substitute 10000000 0.000001 276.73ms 276.99ms
# 6 rlang 10000000 0.000001 114.52ms 119.21ms
# 7 deparse 100000000 0.000001 3.79s 3.79s
# 8 substitute 100000000 0.000001 3.92s 3.92s
# 9 rlang 100000000 0.000001 3.12s 3.12s
# 10 deparse 1000000 0.0001 29.57ms 36.25ms
# 11 substitute 1000000 0.0001 37.22ms 41.56ms
# 12 rlang 1000000 0.0001 19.3ms 24.07ms
# 13 deparse 10000000 0.0001 386.13ms 396.84ms
# 14 substitute 10000000 0.0001 330.22ms 332.42ms
# 15 rlang 10000000 0.0001 270.54ms 274.35ms
# 16 deparse 100000000 0.0001 4.51s 4.51s
# 17 substitute 100000000 0.0001 4.1s 4.1s
# 18 rlang 100000000 0.0001 2.87s 2.87s
[1] with = FALSEor ..columnName does however work only in the j part.
[2] I learned that the hard way when I got a significant performance boost when I replaced purrr::map by base::lapply.
No need for fancy tools, just use base R metaprogramming features.
my_fun2 = function(my_i, my_j, by, my_data) {
dtq = substitute(
my_data[.i, .j, .by],
list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by))
)
print(dtq)
eval(dtq)
}
my_fun2(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
my_fun2(my_j = "Sepal.Length", my_data=as.data.table(iris))
This way you can be sure that data.table will use all possible optimizations as when typing [ call by hand.
Note that in data.table we are planning to make substitution easier, see solution proposed in PR Rdatatable/data.table#4304.
Then using extra env var substitute will be handled internally for you
my_fun3 = function(my_i, my_j, by, my_data) {
my_data[.i, .j, .by, env=list(.i=substitute(my_i), .j=substitute(my_j), .by=substitute(by)), verbose=TRUE]
}
my_fun3(Species == "setosa", sum(Sepal.Length), my_data=as.data.table(iris))
#Argument 'j' after substitute: sum(Sepal.Length)
#Argument 'i' after substitute: Species == "setosa"
#...
my_fun3(my_j = "Sepal.Length", my_data=as.data.table(iris))
#Argument 'j' after substitute: Sepal.Length
#...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With