Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Refactor R code when library functions use non-standard evaluation

Tags:

r

dplyr

I have some R code that looks like this:

library(dplyr)
library(datasets)

iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()

Giving:

Source: local data frame [6 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          4.3         3.0          1.1         0.1     setosa
2          4.6         3.6          1.0         0.2     setosa
3          5.0         2.3          3.3         1.0 versicolor
4          5.1         2.5          3.0         1.1 versicolor
5          4.9         2.5          4.5         1.7  virginica
6          6.0         3.0          4.8         1.8  virginica

This groups by species, and for each group keeps only the two with the shortest Petal.Length. I have some duplication in my code, because I do this several times for different columns and numbers. E.g.:

iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(Petal.Width, ties.method = 'random')<=3) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Width, ties.method = 'random')<=3) %.% ungroup()

I want to extract this into a function. The naive approach doesn't work:

keep_min_n_by_species <- function(expr, n) {
  iris %.% group_by(Species) %.% filter(rank(expr, ties.method = 'random') <= n) %.% ungroup()
}

keep_min_n_by_species(Petal.Width, 2)

Error in filter_impl(.data, dots(...), environment()) : 
  object 'Petal.Width' not found 

As I understand it, the expression rank(Petal.Length, ties.method = 'random') <= 2 is evaluated in a different context, introduced by the filter function, that provides a meaning for the Petal.Length expression. I can't just swap in a variable for Petal.Length, because it will be evaluated in the wrong context. I've tried using different combinations of substitute and eval, having read this page: Non-standard evaluation. I can't figure out an appropriate combination. I think the problem might be that I don't just want to pass through an expression from the caller (Petal.Length) through to filter for it to evaluate - I want to construct a new bigger expression (rank(Petal.Length, ties.method = 'random') <= 2) and then pass that whole expression through to filter for it to evaluate.

  1. How can I refactor this expression into a function?
  2. More generally, how should I go about extracting an R expression into a function?
  3. Even more generally, am I approaching this with the wrong mindset? In more mainstream languages I'm familiar with (e.g. Python, C++, C#), this is a relatively straightforward operation that I want to do all the time to remove duplication in my code. In R it seems (to me, at least) that non-standard evaluation can make it a very non-obvious operation. Should I be doing something else entirely?
like image 466
Weeble Avatar asked Sep 26 '14 11:09

Weeble


2 Answers

dplyr version 0.3 is beginning to address this using the lazyeval package, as @baptiste mentioned, and a new family of functions that use standard evaluation (same function names as the NSE versions, but ending in _). There is a vignette here: https://github.com/hadley/dplyr/blob/master/vignettes/nse.Rmd

All that being said, I don't know best practices for what you're trying to do (though I'm trying to do the same thing). I have something working, but like I said, I don't know if it's the best way to do it. Note the use of filter_() instead of filter(), and passing in the argument as a quoted character string:

devtools::install_github("hadley/dplyr")
devtools::install_github("hadley/lazyeval")

library(dplyr)
library(lazyeval)

keep_min_n_by_species <- function(expr, n, rev = FALSE) {
  iris %>% 
    group_by(Species) %>% 
    filter_(interp(~rank(if (rev) -x else x, ties.method = 'random') <= y, # filter_, not filter
                   x = as.name(expr), y = n)) %>% 
    ungroup()
}

keep_min_n_by_species("Petal.Width", 3) # "Petal.Width" as character string
keep_min_n_by_species("Petal.Width", 3, rev = TRUE)

Update based on @hadley's comment:

keep_min_n_by_species <- function(expr, n) {
  expr <- lazy(expr)

  formula <- interp(~rank(x, ties.method = 'random') <= y,
                    x = expr, y = n)

  iris %>% 
    group_by(Species) %>% 
    filter_(formula) %>% 
    ungroup()
}

keep_min_n_by_species(Petal.Width, 3)
keep_min_n_by_species(-Petal.Width, 3)
like image 127
andyteucher Avatar answered Nov 10 '22 04:11

andyteucher


How about

keep_min_n_by_species <- function(expr, n) {
    mc <- match.call()
    fx <- bquote(rank(.(mc$expr), ties.method = 'random') <= .(mc$n))
    iris %.% group_by(Species) %.% filter(fx) %.% ungroup()
}

That seems to allow all the statements to run without error

keep_min_n_by_species(Petal.Width, 2)
keep_min_n_by_species(-Petal.Width, 2)
keep_min_n_by_species(Petal.Width, 3)
keep_min_n_by_species(-Petal.Width, 3)

The idea is that we use match.call() to capture the unevaluated expressions passed to the function. Then we use bquote() to build the filter as a call object.

like image 44
MrFlick Avatar answered Nov 10 '22 03:11

MrFlick