data.table grouped operations with variable names of columns without slow DT[, mean(get(colName)), by = grp]

Question

I want to create a function which uses variable names of columns and variable name of data.

This function is what I want and it works :

n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))
dataName = "d"
colName = "x"

# Objective :
FOO <- function(dataName = "d",
         colName = "x"){
  get(dataName)[, mean(get(colName)), by = grp]
}

The problem is that evaluation of get() for each group is very time-consuming. On a real data example it is 14 times longer than the static-name equivalent. I would like to reach the same execution time as if the column names were static.

What I tried :

(cl <- substitute(mean(eval(parse(text = colName))), list(colName = as.name(colName))))

microbenchmark::microbenchmark(

  # 1) works and quick but does not use variable names of columns (654ms)
  (t1 <- d[, mean(x), by = grp]),

  # 2) works but slow (1006ms)
  (t2 <- get(dataName)[, mean(get(colName)), by = grp]), # works but slow

  # 3) works but slow (4075ms)
  (t3 <- eval(parse(text = dataName))[, mean(eval(parse(text = colName))), by = grp]),

  # 4) works but very slow (37202ms)
  (t4 <- get(dataName)[, eval(cl), by = grp]),

  # 5) double dot syntax doesn't work cause I don't master it
  # (t5 <- get(dataName)[, mean(..colName), by = grp]),

  times = 10)

Is the double dot syntax appropriate here ? Why is 4) so slow ? I took it from this post where it was the best option. I adapted the double dot syntax from this post.

Thanks a lot for your help !

B. Christian Kamgang · Accepted Answer

It would be better to pass the dataset name d to the FOO function instead of passing the character string "d". Also, you can use lapply combined with .SD so that you can benefit from internal optimization instead of using mean(get(colName)).

FOO2 = function(dataName=d, colName = "x") { # d instead of "d" passed to the first argument!
  dataName[, lapply(.SD, mean), by=grp, .SDcols=colName]
}

Benchmark: `FOO` vs `FOO2`

set.seed(147852)
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))

microbenchmark::microbenchmark(
  FOO(),
  FOO2(),
  times=5L
)

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
  FOO() 4632.4014 4672.7781 4787.4958 4707.9023 4846.7081 5077.6893     5
 FOO2()  255.0828  267.1322  297.0389  275.4467  281.9873  405.5456     5

data.table grouped operations with variable names of columns without slow DT[, mean(get(colName)), by = grp]

Tags:

r

data.table

Samuel Allain

1 Answers

Benchmark: `FOO` vs `FOO2`

B. Christian Kamgang

Recent Activity

Donate For Us

data.table grouped operations with variable names of columns without slow DT[, mean(get(colName)), by = grp]

Tags:

r

data.table

Samuel Allain

1 Answers

Benchmark: FOO vs FOO2

B. Christian Kamgang

Related questions

Recent Activity

Donate For Us

Benchmark: `FOO` vs `FOO2`