Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table grouped operations with variable names of columns without slow DT[, mean(get(colName)), by = grp]

Tags:

r

data.table

I want to create a function which uses variable names of columns and variable name of data.

This function is what I want and it works :

n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))
dataName = "d"
colName = "x"

# Objective :
FOO <- function(dataName = "d",
         colName = "x"){
  get(dataName)[, mean(get(colName)), by = grp]
}

The problem is that evaluation of get() for each group is very time-consuming. On a real data example it is 14 times longer than the static-name equivalent. I would like to reach the same execution time as if the column names were static.

What I tried :

(cl <- substitute(mean(eval(parse(text = colName))), list(colName = as.name(colName))))

microbenchmark::microbenchmark(

  # 1) works and quick but does not use variable names of columns (654ms)
  (t1 <- d[, mean(x), by = grp]),

  # 2) works but slow (1006ms)
  (t2 <- get(dataName)[, mean(get(colName)), by = grp]), # works but slow

  # 3) works but slow (4075ms)
  (t3 <- eval(parse(text = dataName))[, mean(eval(parse(text = colName))), by = grp]),

  # 4) works but very slow (37202ms)
  (t4 <- get(dataName)[, eval(cl), by = grp]),

  # 5) double dot syntax doesn't work cause I don't master it
  # (t5 <- get(dataName)[, mean(..colName), by = grp]),

  times = 10)

Is the double dot syntax appropriate here ? Why is 4) so slow ? I took it from this post where it was the best option. I adapted the double dot syntax from this post.

Thanks a lot for your help !

like image 352
Samuel Allain Avatar asked Oct 25 '25 05:10

Samuel Allain


1 Answers

It would be better to pass the dataset name d to the FOO function instead of passing the character string "d". Also, you can use lapply combined with .SD so that you can benefit from internal optimization instead of using mean(get(colName)).

FOO2 = function(dataName=d, colName = "x") { # d instead of "d" passed to the first argument!
  dataName[, lapply(.SD, mean), by=grp, .SDcols=colName]
}

Benchmark: FOO vs FOO2

set.seed(147852)
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))

microbenchmark::microbenchmark(
  FOO(),
  FOO2(),
  times=5L
)

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
  FOO() 4632.4014 4672.7781 4787.4958 4707.9023 4846.7081 5077.6893     5
 FOO2()  255.0828  267.1322  297.0389  275.4467  281.9873  405.5456     5
like image 53
B. Christian Kamgang Avatar answered Oct 26 '25 18:10

B. Christian Kamgang



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!