Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Evaluating both column name and the target value within `j` expression within `data.table`

Tags:

r

data.table

Consider

target <- "vs"
value <- 1

library(data.table)
dt <- as.data.table(head(mtcars))

So I'm trying to pass both column name and a value as variables into the j expression within data.table environment, something that would be equivalent to

dt[, vs == 1]
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

If only the value is the variable, it works nicely

dt[, vs == value]
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

We can also call the column within the data.table scope when it's a variable

dt[, target, with = FALSE]
#    vs
# 1:  0
# 2:  0
# 3:  1
# 4:  1
# 5:  0
# 6:  1

But I can't figure how to combine the two in a simple manner

Note: I'm well aware that I can simply do:

dt[[target]] == value
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

But I need it within the data table scope so I could modify other columns by reference, something like

dt[, NEWCOL := sum(vs == 1), by = am]

So here are my tries when both column name and the value are variables

dt[, target == value, with = FALSE]
# Null data.table (0 rows and 0 cols)
dt[, target == value]
# [1] FALSE
dt[, (target) == value]
# [1] FALSE
dt[, .(target == value)]
# V1
# 1: FALSE
dt[, eval(target) == value]
# [1] FALSE
dt[target %in% value]
## Empty data.table (0 rows) of 11 cols: mpg,cyl,disp,hp,drat,wt...

Eventually I came up with

dt[, .SD[[target]] == value]
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

but it is very inefficient, here's a simple benchmark

set.seed(123)
n <- 1e6
dt <- data.table(vs = sample(1L:30L, n, replace = TRUE), am = seq_len(n))
system.time(dt[, NEWCOL := sum(.SD[[target]] == value), by = am])
#  user  system elapsed 
# 13.00    0.02   13.12 
system.time(dt[, NEWCOL2 := sum(vs == value), by = am])
# user  system elapsed 
# 0.82    0.00    0.83 

Question: Is there any better way of doing this that I'm missing here? Something either more idiomatic or much more efficient


Edit

Originally I was looking for something idiomatic, so I thought @GGrothendieck simple solution using get was the one, but surprisingly all @Richard version are beating even the version that ins't doing any evaluation of the column name

set.seed(123)
n <- 1e7
dt <- data.table(vs = sample(1L:30L, n, replace = TRUE), am = seq_len(n))

cl <- substitute(
  x == y, 
  list(x = as.name(target), y = value)
)
cl2 <- call("==", as.name(target), value)

system.time(dt[, NEWCOL := sum(vs == value), by = am])
#   user  system elapsed 
#   0.83    0.00    0.82 
system.time(dt[, NEWCOL1 := sum(.SD[[target]] == value), by = am])
#   user  system elapsed 
#   8.97    0.00    8.97 
system.time(dt[, NEWCOL2 := sum(get(target) == value), by = am])
#   user  system elapsed 
#   2.35    0.00    2.37 
system.time(dt[, NEWCOL3 := sum(eval(cl)), by = am])
#   user  system elapsed 
#   0.69    0.02    0.71 
system.time(dt[, NEWCOL4 := sum(eval(cl2)), by = am])
#   user  system elapsed 
#   0.76    0.00    0.77 
system.time(dt[, NEWCOL5 := sum(eval(as.name(target)) == value), by = am])
#   user  system elapsed 
#   0.78    0.00    0.78 
like image 799
David Arenburg Avatar asked Dec 28 '14 14:12

David Arenburg


1 Answers

Here is one possible alternative.

target <- "vs"
value <- 1
dt <- as.data.table(head(mtcars))

In terms of code it's not necessarily simpler, but we can set up an unevaluated call cl defined outside the scope of dt which is to be evaluated inside the data table's environment.

cl <- substitute(
    x == y, 
    list(x = as.name(target), y = value)
)

substitute() might be necessary for longer expressions. But in this case, call() would shorten the code and create the same cl result. And so cl could also be

cl <- call("==", as.name(target), value)

Now we can evaluate cl inside dt. On your example this seems to work fine.

dt[, NEWCOL := sum(eval(cl)), by = am][]
#     mpg cyl disp  hp drat    wt  qsec vs am gear carb NEWCOL
# 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4      1
# 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4      1
# 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1      1
# 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1      2
# 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2      2
# 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1      2

After thinking about this for a minute, I'm not sure value needed to be substituted, and hence the following also works. But as David notes, the first approach is more time efficient.

dt[, eval(as.name(target)) == value]
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE
like image 166
Rich Scriven Avatar answered Oct 22 '22 14:10

Rich Scriven