Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Variables as default arguments of a function, using dplyr

Goal

My goal is to define some functions for use within dplyr verbs, that use pre-defined variables. This is because I have some of these functions that take a bunch of arguments, of which many always are the same variable names.

My understanding: This is difficult (and perhaps impossible) because dplyr will lazily evaluate user-specified variables later on, but any default arguments are not in the function call and therefore invisible to dplyr.

Toy example

Consider the following example, where I use dplyr to calculate whether a variable has changed or not (rather meaningless in this case):

library(dplyr)
mtcars  %>%
  mutate(cyl_change = cyl != lag(cyl))

Now, lag also supports alternate ordering like so:

mtcars  %>%
  mutate(cyl_change = cyl != lag(cyl, order_by = gear))

But what if I'd like to create my own version of lag that always orders by gear?

Failed attempts

The naive approach is this:

lag2 <- function(x, n = 1L, order_by = gear) lag(x, n = n, order_by = order_by)

mtcars %>%
  mutate(cyl_change = cyl != lag2(cyl))

But this obviously raises the error:

no object named ‘gear’ was found

More realistic options would be these, but they also don't work:

lag2 <- function(x, n = 1L) lag(x, n = n, order_by = ~gear)
lag2 <- function(x, n = 1L) lag(x, n = n, order_by = get(gear))
lag2 <- function(x, n = 1L) lag(x, n = n, order_by = getAnywhere(gear))
lag2 <- function(x, n = 1L) lag(x, n = n, order_by = lazyeval::lazy(gear))

Question

Is there a way to get lag2 to correctly find gear within the data.frame that dplyr is operating on?

  • One should be able to call lag2 without having to provide gear.
  • One should be able to use lag2 on datasets that are not called mtcars (but do have gear as one it's variables).
  • Preferably gear would be a default argument to the function, so it can still be changed if required, but this is not crucial.
like image 208
Axeman Avatar asked Mar 29 '16 14:03

Axeman


People also ask

How do you set a default argument in R?

Adding a Default Value in R You can specify default values for any disagreements in the argument list by adding the = sign and default value after the respective argument. You can specify a default value for argument mult to avoid specifying mult=100 every time.

Can you use dplyr in a function?

dplyr functions use non-standard evaluation. That is why you do not have to quote your variable names when you do something like select(mtcars, mpg) , and why select(mtcars, "mpg") doesn't work. When you use dplyr in functions, you will likely want to use "standard evaluation".

What special operator is used by dplyr to pass a function argument to one of its methods?

dplyr utilizes pipe operator from another package (magrittr). It allows you to write sub-queries like we do it in sql. Note : All the functions in dplyr package can be used without the pipe operator.

What is dplyr function in R?

dplyr aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with: Rows: filter() chooses rows based on column values.


1 Answers

Here are two approaches in data.table, however I don't believe that either of them will work in dplyr at the present.

In data.table, whatever is inside the j-expression (aka the 2nd argument of [.data.table) gets parsed by the data.table package first, and not by regular R parser. In a way you can think of it as a separate language parser living inside the regular language parser that is R. What this parser does, is it looks for what variables you have used that are actually columns of the data.table you're operating on, and whatever it finds it puts it in the environment of the j-expression.

What this means, is that you have to let this parser know somehow that gear will be used, or it simply will not be part of the environment. Following are two ideas for accomplishing that.

The "simple" way to do it, is to actually use the column name in the j-expression where you call lag2 (in addition to some monkeying within lag2):

dt = as.data.table(mtcars)

lag2 = function(x) lag(x, order_by = get('gear', sys.frame(4)))

dt[, newvar := {gear; lag2(cyl)}]
# or
dt[, newvar := {.SD; lag2(cyl)}]

This solution has 2 undesirable properties imo - first, I'm not sure how fragile that sys.frame(4) is - you put this thing in a function or a package and I don't know what will happen. You can probably work around it and figure out the right frame, but it's kind of a pain. Second - you either have to mention the particular variable you're interested in, anywhere in the expression, or dump all of them in the environment by using .SD, again anywhere.

A second option that I like more, is to take advantage of the fact that the data.table parser evaluates eval expressions in place before the variable lookup, so if you use a variable inside some expression that you eval, that would work:

lag3 = quote(function(x) lag(x, order_by = gear))

dt[, newvar := eval(lag3)(cyl)]

This doesn't suffer from the issues of the other solution, with the obvious disadvantage of having to type an extra eval.

like image 162
eddi Avatar answered Sep 20 '22 12:09

eddi