Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an equivalent of dplyr data pronouns in data.table?

Tags:

r

data.table

Is there a way to tell data.table to look for an external variable instead of a column name, just like what you can do with the .env pronoun in dplyr? Imagine you have a dataframe with the column name and a variable with the same name, how do you distinguish? Have a look at the following example:

animalDf <- data.frame(
  animal = c("snail", "spider", "bear"),
  legs = c(0, 8, 4)
)
animal <- "spider"
animalDf |> 
  dplyr::filter(.data$animal == .env$animal) |> 
  dplyr::pull(legs)
# I get the correct result: 8
animalDt <- data.table::as.data.table(animalDf)
animalDt[animal == animal, legs] # obviously does not work

In functions I might not be able to control all names of the data.table, so it will be very important to be able to distinguish and tell explicitly that the environment variable shall be used.

like image 409
Noskario Avatar asked Jan 02 '26 00:01

Noskario


1 Answers

Using env

We can use env to dynamically subset. This was introduced in data.table v1.15.0 (Jan 2024) and is described in the Programming on data.table vignette.

Note that because in this case we want to provide the actual character value, i.e. "spider", rather than a column called spider, we wrap it in the I() function. As the docs note:

The I function marks an object as AsIs, preventing its arguments from character-to-symbol automatic conversion.

This is only required for character columns - see this similar question with a numeric column, where I() is not required.

animalDt[
    animal == animal_var,
    legs,
    env = list(animal_var = I(animal))
]
# [1] 8

Alternative approach: the .. prefix

Alternatively, in this instance, you can use the .. prefix to refer to objects in the parent environment. As the data.table vignettes note:

For those familiar with the Unix terminal, the .. prefix should be reminiscent of the “up-one-level” command, which is analogous to what’s happening here – the .. signals to data.table to look for the select_cols variable “up-one-level”, i.e., within the global environment in this case.

animalDt[, legs[animal == ..animal]]
# [1] 8

I think the vignette is actually a little conservative as .. can access variables which are more than one level up if necessary, otherwise the following would not work:

f <- function(dt) {
    g <- function(dt) dt[, legs[animal == ..animal]]
    g(dt)
}

f(animalDt)
# [1] 8

This is not a good way to write a function (animal should be a parameter) but .. is under the hood doing get0("animal", parent.frame()). This means it will be able to access animal if it exists in frames enclosing the parent frame, such as the global environment.

However, note that we are subsetting the legs column, i.e. making a copy at certain indices, which with a very large data.table could be slow.

This is because we can only use .. in j but not in i, i.e. this does not work:

animalDt[animal == ..animal, legs]
# Error in eval(stub[[3L]], x, enclos) : object '..animal' not found

Personally, I find .. more readable, and if performance is not a large concern I would use it. However, for a more generalisable and performant approach, env is the way to go.

Other approaches are retired

It is also possible to instead use get(), mget() or eval() here (as it done in the accepted answer to the similar question) but as Friede states in the comments, these approaches have now been retired in data.table in favour of env.

like image 176
SamR Avatar answered Jan 03 '26 20:01

SamR



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!