Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get(x) does not work in R data.table when x is also a column in the data table

Tags:

r

data.table

I noticed that get(x) does not work in R data table when x is also a column in the same data table. See the code snippet below. This is hard to avoid completely when writing an R function which takes the data table as an input. Is this a bug in the R data.table package? Thanks!

library(data.table)

dt = data.table(x=1:3, y=2:4)

var = 'y'
x = 'y'

dt[, 3*get(var)]      # [1] 6 9 12
dt[, 3*get(x)]        # Error in get(x): invalid first argument
like image 839
user9210742 Avatar asked Jan 12 '18 21:01

user9210742


2 Answers

In general, when there is a naming conflict between columns and variables, columns will take precedence. Since v1.10.2 (31 Jan 2017) of data.table, the preferred approach to clarify that a name is a not a column name is to use the .. prefix [1]:

When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers. When you see the .. prefix think one-level-up, like the directory .. in all operating systems means the parent directory. In future the .. prefix could be made to work on all symbols apearing anywhere inside DT[...]. ...

Our main focus here which we believe .. achieves is to resolve the more common ambiguity when var is in calling scope and var is a column name too. Further, we have not forgotten that in the past we recommended prefixing the variable in calling scope with .. yourself. If you did that and ..var exists in calling scope, that still works, provided neither var exists in calling scope nor ..var exists as a column name. Please now remove the .. prefix on ..var in calling scope to tidy this up. In future data.table will start to warn/error on such usage.

In your case, you can get(..x) to force the name x to be resolved in calling scope rather than within the data.table environment:

library(data.table)

dt = data.table(x=1:3, y=2:4)

var = 'y'
x = 'y'

dt[, 3*get(var)]      # [1] 6 9 12
dt[, 3*get(x)]        # Error in get(x): invalid first argument
dt[, 3*get(..x)]      # [1]  6  9 12

The .. prefix is still somewhat experimental and thus has limited documentation, but it is mentioned briefly on the help page for data.table:

By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables. In case of overlapping variables names inside dataset and in parent scope you can use double dot prefix ..cols to explicitly refer to 'cols variable parent scope and not from your dataset.

This is less a bug and more an unfortunate but natural consequence of with = T to allow using columns as variables in a data environment. Indeed, you could avoid this issue in a more base R way by using the pos or envir argument of get().

like image 113
Bob Avatar answered Sep 20 '22 03:09

Bob


New Answer

Based on advice from @Frank and this section of the vignette I can't believe I hadn't read before, here's a solution to this problem that doesn't allow arbitrary code to be executed.

library(data.table)
dt = data.table(x=1:3, y=2:4)

x = "y"
ExecuteMeLater = substitute(3*x, list(x=as.symbol(x)))
dt[, eval(ExecuteMeLater)]

# [1]  6  9 12

This behavior in particular is why I prefer this solution:

x = "(system(paste0('kill ',Sys.getpid())))"
ExecuteMeLater = substitute(3*x, list(x=as.symbol(x)))
dt[, eval(ExecuteMeLater)]

#Error in eval(jsub, SDenv, parent.frame()) : 
#  object '(system(paste0('kill ',Sys.getpid())))' not found

Original Answer

Note: came across what looks like a really useful resource for questions of this nature... might be able to update with a less hacky solution at some point.

The get() behavior certainly leaves the door open for unexpected outcomes, and it appears this has been brought up in more than a few some github issues in the past. To be frankly honest I've done a decent amount of investigation but I'm still not quite following exactly what the proper usage would be.

One way you can work around it is by pasting together the expression and evaluating your function input column names outside of the data.table environment and storing it as a character.

Then, by parsing and evaluating the pre-constructed expression in the data.table environment we avoid any opportunity for a column named x within the table to take precedence over the contents of the variable x.

library(data.table)

dt = data.table(x=1:3, y=2:4)

x = 'y'
ExecuteMeLater <- paste0("3*",x)  ## "3*y"
dt[, eval(parse(text = ExecuteMeLater))]

Output:

[1]  6  9 12

Not the prettiest solution, but it's worked for me numerous times in the past.

Quick disclaimer on hypothetical doomsday scenarios possible with eval(parse(...))

There are far more in depth discussions on the dangers eval(parse(...)), but I'll avoid repeating them in full.

Theoretically you could have issues if one of your columns is named something unfortunate like "(system(paste0('kill ',Sys.getpid())))" (Do not execute that, it will kill your R session on the spot). This is probably enough of an outside chance to not lose sleep over it unless you plan on putting this in a package on CRAN.

like image 30
Matt Summersgill Avatar answered Sep 20 '22 03:09

Matt Summersgill