I am exploring different ways to wrap an aggregation function (but really it could be any type of function) using data.table (one dplyr example is also provided) and was wondering on best practices for functional programming / metaprogramming with respect to
The basic application is to aggregate a table flexibly, i.e. parameterizing the variables to aggregate, the dimensions to aggregate by, the respective resulting variable names of both and the aggregation function. I have implemented (nearly) the same function in three data.table and one dplyr way:
libraries
library(data.table)
library(dplyr)
data
n_size <- 1*10^6
sample_metrics <- sample(seq(from = 1, to = 100, by = 1), n_size, rep = T)
sample_dimensions <- sample(letters[10:12], n_size, rep = T)
df <-
data.frame(
a = sample_metrics,
b = sample_metrics,
c = sample_dimensions,
d = sample_dimensions,
x = sample_metrics,
y = sample_dimensions,
stringsAsFactors = F)
dt <- as.data.table(df)
implementations
1. fn_dt_agg1
fn_dt_agg1 <-
function(dt, metric, metric_name, dimension, dimension_name) {
temp <- dt[, setNames(lapply(.SD, function(x) {sum(x, na.rm = T)}),
metric_name),
keyby = dimension, .SDcols = metric]
temp[]
}
res_dt1 <-
fn_dt_agg1(
dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
dimension = c("c", "d"), dimension_name = c("c", "d"))
2. fn_dt_agg2
fn_dt_agg2 <-
function(dt, metric, metric_name, dimension, dimension_name,
agg_type) {
j_call = as.call(c(
as.name("."),
sapply(setNames(metric, metric_name),
function(var) as.call(list(as.name(agg_type),
as.name(var), na.rm = T)),
simplify = F)
))
dt[, eval(j_call), keyby = dimension][]
}
res_dt2 <-
fn_dt_agg2(
dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
dimension = c("c", "d"), dimension_name = c("c", "d"),
agg_type = c("sum"))
all.equal(res_dt1, res_dt2)
#TRUE
3. fn_dt_agg3
fn_dt_agg3 <-
function(dt, metric, metric_name, dimension, dimension_name, agg_type) {
e <- eval(parse(text=paste0("function(x) {",
agg_type, "(", "x, na.rm = T)}")))
temp <- dt[, setNames(lapply(.SD, e),
metric_name),
keyby = dimension, .SDcols = metric]
temp[]
}
res_dt3 <-
fn_dt_agg3(
dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
dimension = c("c", "d"), dimension_name = c("c", "d"),
agg_type = "sum")
all.equal(res_dt1, res_dt3)
#TRUE
4. fn_df_agg1
fn_df_agg1 <-
function(df, metric, metric_name, dimension, dimension_name, agg_type) {
all_vars <- c(dimension, metric)
all_vars_new <- c(dimension_name, metric_name)
dots_group <- lapply(dimension, as.name)
e <- eval(parse(text=paste0("function(x) {",
agg_type, "(", "x, na.rm = T)}")))
df %>%
select_(.dots = all_vars) %>%
group_by_(.dots = dots_group) %>%
summarise_each_(funs(e), metric) %>%
rename_(.dots = setNames(all_vars, all_vars_new))
}
res_df1 <-
fn_df_agg1(
df = df, metric = c("a", "b"), metric_name = c("a", "b"),
dimension = c("c", "d"), dimension_name = c("c", "d"),
agg_type = "sum")
all.equal(res_dt1, as.data.table(res_df1))
#"Datasets has different keys. 'target': c, d. 'current' has no key."
benchmarking
Just out of curiosity and for my future self and other interested parties, I ran a benchmark of all 4 implementations which potentially already sheds light on the performance issue (although I'm not a benchmarking expert so please excuse if I haven't applied commonly agreed best practices). I was expecting fn_dt_agg1 to be the fastest as it has one parameter less (aggregation function) but that doesn't seem to have a sizable impact. I was also surprised by the relatively slow dplyr function but this may be due to a bad design choice on my end.
library(microbenchmark)
bench_res <-
microbenchmark(
fn_dt_agg1 =
fn_dt_agg1(
dt = dt, metric = c("a", "b"),
metric_name = c("a", "b"),
dimension = c("c", "d"),
dimension_name = c("c", "d")),
fn_dt_agg2 =
fn_dt_agg2(
dt = dt, metric = c("a", "b"),
metric_name = c("a", "b"),
dimension = c("c", "d"),
dimension_name = c("c", "d"),
agg_type = c("sum")),
fn_dt_agg3 =
fn_dt_agg3(
dt = dt, metric = c("a", "b"),
metric_name = c("a", "b"),
dimension = c("c", "d"),
dimension_name = c("c", "d"),
agg_type = c("sum")),
fn_df_agg1 =
fn_df_agg1(
df = df, metric = c("a", "b"), metric_name = c("a", "b"),
dimension = c("c", "d"), dimension_name = c("c", "d"),
agg_type = "sum"),
times = 100L)
bench_res
# Unit: milliseconds
# expr min lq mean median uq max neval
# fn_dt_agg1 28.96324 30.49507 35.60988 32.62860 37.43578 140.32975 100
# fn_dt_agg2 27.51993 28.41329 31.80023 28.93523 33.17064 84.56375 100
# fn_dt_agg3 25.46765 26.04711 30.11860 26.64817 30.28980 153.09715 100
# fn_df_agg1 88.33516 90.23776 97.84826 94.28843 97.97154 172.87838 100
other resources
I don't recommend eval(parse())
. You can achieve the same as in approach three without it:
fn_dt_agg4 <-
function(dt, metric, metric_name, dimension, dimension_name, agg_type) {
e <- function(x) getFunction(agg_type)(x, na.rm = T)
temp <- dt[, setNames(lapply(.SD, e),
metric_name),
keyby = dimension, .SDcols = metric]
temp[]
}
This also avoids some security risks.
PS: You can check what data.table is doing regarding optimizations by setting options("datatable.verbose" = TRUE)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With