Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using data.table and tidy eval together: why group by does not work as expected, why is ~ inserted?

I do not have a pressing use case but would like to understand how tidy eval and data.table may work together.

I have working alternative solutions so I am mostly interested in the why because I hope to have a better understanding of tidy eval in general which would help me in a wide variety of use cases.

How to make data.table + tidy eval work with group by?

In the following examples I used the development version of rlang.

update

I updated my original question based on Stefan F's answer and my further explorations: I no longer think the inserted ~ is a significant part of the question as it is present in the dplyr code as well, but I have a specific code: data.table + group by + quo which I d not understand why does not work.

# setup ------------------------------------

suppressPackageStartupMessages(library("data.table"))
suppressPackageStartupMessages(library("rlang"))
suppressPackageStartupMessages(library("dplyr"))
#> Warning: package 'dplyr' was built under R version 3.5.1

dt <- data.table(
    num_campaign = 1:5,
    id = c(1, 1, 2, 2, 2)
)
df <- as.data.frame(dt)

# original question ------------------------

aggr_expr <- quo(sum(num_campaign))

q <- quo(dt[, aggr := !!aggr_expr][])

e <- quo_get_expr(q)
e
#> dt[, `:=`(aggr, ~sum(num_campaign))][]
dt[, `:=`(aggr, ~sum(num_campaign))][]
#> Error in `[.data.table`(dt, , `:=`(aggr, ~sum(num_campaign))): RHS of assignment is not NULL, not an an atomic vector (see ?is.atomic) and not a list column.
eval_tidy(e, data = dt)
#>    num_campaign id aggr
#> 1:            1  1   15
#> 2:            2  1   15
#> 3:            3  2   15
#> 4:            4  2   15
#> 5:            5  2   15

using expression instead of quo is not good in this case as variables in the user-supplied expression might not be evaluated in the good environment:

# updated question --------------------------------------------------------

aggr_dt_expr <- function(dt, aggr_rule) {
    aggr_expr <- enexpr(aggr_rule)
    x <- 2L
    q <- quo(dt[, aggr := !!aggr_expr][])
    eval_tidy(q, data = dt)
}

x <- 1L
# expression is evaluated with x = 2
aggr_dt_expr(dt, sum(num_campaign) + x)
#>    num_campaign id aggr
#> 1:            1  1   17
#> 2:            2  1   17
#> 3:            3  2   17
#> 4:            4  2   17
#> 5:            5  2   17

aggr_dt_quo <- function(dt, aggr_rule) {
    aggr_quo <- enquo(aggr_rule)
    x <- 2L
    q <- quo(dt[, aggr := !!aggr_quo][])
    eval_tidy(q, data = dt)
}

x <- 1L
# expression is evaluated with x = 1
aggr_dt_quo(dt, sum(num_campaign) + x)
#>    num_campaign id aggr
#> 1:            1  1   16
#> 2:            2  1   16
#> 3:            3  2   16
#> 4:            4  2   16
#> 5:            5  2   16

I have an explicit problem using group by:

# using group by --------------------------------

grouped_aggr_dt_expr <- function(dt, aggr_rule) {
    aggr_quo <- enexpr(aggr_rule)
    x <- 2L
    q <- quo(dt[, aggr := !!aggr_quo, by = id][])
    eval_tidy(q, data = dt)
}

# group by has effect but x = 2 is used
grouped_aggr_dt_expr(dt, sum(num_campaign) + x)
#>    num_campaign id aggr
#> 1:            1  1    5
#> 2:            2  1    5
#> 3:            3  2   14
#> 4:            4  2   14
#> 5:            5  2   14

grouped_aggr_dt_quo <- function(dt, aggr_rule) {
    aggr_quo <- enquo(aggr_rule)
    x <- 2L
    q <- quo(dt[, aggr := !!aggr_quo, by = id][])
    eval_tidy(q, data = dt)
}

# group by has no effect
grouped_aggr_dt_quo(dt, sum(num_campaign) + x)
#>    num_campaign id aggr
#> 1:            1  1   16
#> 2:            2  1   16
#> 3:            3  2   16
#> 4:            4  2   16
#> 5:            5  2   16


# using dplyr works fine ------------------------------------------------------------

grouped_aggr_df_quo <- function(df, aggr_rule) {
    aggr_quo <- enquo(aggr_rule)
    x <- 2L
    q <- quo(mutate(group_by(df, id), !!aggr_quo))
    eval_tidy(q)
}
grouped_aggr_df_quo(df, sum(num_campaign) + x)
#> # A tibble: 5 x 3
#> # Groups:   id [2]
#>   num_campaign    id `sum(num_campaign) + x`
#>          <int> <dbl>                   <int>
#> 1            1     1                       4
#> 2            2     1                       4
#> 3            3     2                      13
#> 4            4     2                      13
#> 5            5     2                      13

I understand extracting expressions from quosures is not the way to work with tidy eval but I hoped to use it as a debugging tool: (not much luck so far)

# returning expression in quo for debugging --------------

grouped_aggr_dt_quo_debug <- function(dt, aggr_rule) {
    aggr_quo <- enquo(aggr_rule)
    x <- 2L
    q <- quo(dt[, aggr := !!aggr_quo, by = id][])
    quo_get_expr(q)
}

grouped_aggr_dt_quo_debug(dt, sum(num_campaign) + x)
#> dt[, `:=`(aggr, ~sum(num_campaign) + x), by = id][]

grouped_aggr_df_quo_debug <- function(df, aggr_rule) {
    aggr_quo <- enquo(aggr_rule)
    x <- 2L
    q <- quo(mutate(group_by(df, id), !!aggr_quo))
    quo_get_expr(q)
}
# ~ is inserted in this case as well so it is not the problem
grouped_aggr_df_quo_debug(df, sum(num_campaign) + x)
#> mutate(group_by(df, id), ~sum(num_campaign) + x)

Created on 2018-08-12 by the reprex package (v0.2.0).

Original wording of the question:

Why is a ~ inserted and why isn't it a problem with tidy eval if it is a problem with base eval and everything is in the global environment?

This example is derived from a more realistic but also more complicated use case where I got unexpected results.

like image 987
Ildi Czeller Avatar asked Aug 11 '18 19:08

Ildi Czeller


2 Answers

TLDR: Quosures are implemented as formulas because of a bug that affects all versions of R prior to 3.5.1. The special rlang definition for ~ is only available with eval_tidy(). This is why quosures are not as compatible with non-tidyeval functions as we'd like to.

Edit: That said, there are probably other challenges to make data masking APIs like data.table compatible with quosures.


Quosures are currently implemented as formulas:

library("rlang")

q <- quo(cat("eval!\n"))

is.call(q)
#> [1] TRUE

as.list(unclass(q))
#> [[1]]
#> `~`
#>
#> [[2]]
#> cat("eval!\n")
#>
#> attr(,".Environment")
#> <environment: R_GlobalEnv>

Compare to ordinary formulas:

f <- ~cat("eval?\n")

is.call(f)
#> [1] TRUE

as.list(unclass(f))
#> [[1]]
#> `~`
#>
#> [[2]]
#> cat("eval?\n")
#>
#> attr(,".Environment")
#> <environment: R_GlobalEnv>

So what's the difference between a quosure and a formula? The former evaluates itself while the latter quotes itself, i.e. it returns itself.

eval_tidy(q)
#> eval!

eval_tidy(f)
#> ~cat("eval?\n")

The self-quoting mechanism is implemented by the ~ primitive:

`~`
#> .Primitive("~")

One important task of this primitive is to record an environment the very first time a formula is evaluated. For instance the formula in quote(~foo) is not evaluated and does not record an environment while eval(quote(~foo)) does.

Anyway, when you evaluate a ~ call, the definition for ~ is looked up in the ordinary way and usually finds the ~ primitive. Just like when you compute 1 + 1, the definition for + is looked up and usually the .Primitive("+") is found. The reason quosures self-evaluate instead of self-quote is simply that eval_tidy() creates a special definition for ~ in its evaluation environment. You can get a hold on this special definition with eval_tidy(quote(`~`)).

So why did we implement quosures as formulas?

  1. It deparses and prints better. This reason is now outdated because we have our own expression deparser where quosures are printed with a leading ^ rather than a leading ~.

  2. Because of a bug in all versions of R prior to 3.5.1, expressions with a class are evaluated on recursive prints. Here is an example of classed call:

    x  <- quote(stop("oh no!"))
    x <- structure(x, class = "some_class")
    

    The object itself prints fine:

    x
    #> stop("oh no!")
    #> attr(,"class")
    #> [1] "some_class"
    

    But if you put it in a list it gets evaluated!

    list(x)
    #> [[1]]
    #> Error in print(stop("oh no!")) : oh no!
    

The eager evaluation bug does not affect formulas because they self-quote. Implementing quosures as formulas protected us from this bug.

Ideally we'll inline a function directly in the quosure. E.g. the first element wouldn't contain the symbol ~ but a function. Here is how you can create such functions:

c <- as.call(list(toupper, "a"))
c
#> (function (x)
#> {
#>     if (!is.character(x))
#>         x <- as.character(x)
#>     .Internal(toupper(x))
#> })("a")

The biggest advantage of inlining functions in calls is that they can be evaluated anywhere. Even in the empty environment!

eval(c, emptyenv())
#> [1] "A"

If we implemented quosures with inlined functions, they could similarly be evaluated anywhere. eval(q) would work, you could unquote quosures inside data.table calls, etc. But did you notice how noisy the inlined call prints because of the inlining? To work around this we'd have to give the call a class and a print method. But remember the R <= 3.5.0 bug. We'd get weird eager evaluations when printing lists of quosures at the console. This is why quosures are still implemented as formulas to this day and are not as compatible with non-tidyeval functions as we'd like.

like image 64
Lionel Henry Avatar answered Oct 17 '22 14:10

Lionel Henry


You need to use expr() instead of quo()

expr() captures an expression, quo() captures the expression + the environment in which the expression should be evaluated ("quosure").

quosures are a rlang/tidyeval specific thing and so you need to use tidyeval to evaluate them.

As to ~: A tilde is used for formulas in R. Formulas are special R objects that were designed to specify models in R (such as lm()), but they have some interesting properties that makes them useful for other purposes as well. Apparently rlang uses them for representing quosures (but I don't know to much about the internals here).

base::eval() thinks your supplying a formula and doesn't know what to do with it in that context, while eval_tidy() knows that you are actually passing a quosure. You don't have that problem with rlang::expr(), because that one returns objects that also base R knows how to handle.

like image 24
Stefan F Avatar answered Oct 17 '22 15:10

Stefan F