Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

standard evaluation in dplyr: summarise a variable given as a character string

Tags:

r

dplyr

UPDATE July 2020:

dplyr 1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr programming vignette here:

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

The new way to refer to columns when their identifier is stored as a character vector is to use the .data pronoun from rlang, and then subset as you would in base R.

library(dplyr)  key <- "v3" val <- "v2" drp <- "v1"  df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))  df %>%      select(-matches(drp)) %>%      group_by(.data[[key]]) %>%      summarise(total = sum(.data[[val]], na.rm = TRUE))  #> `summarise()` ungrouping output (override with `.groups` argument) #> # A tibble: 2 x 2 #>   v3    total #>   <chr> <int> #> 1 A        21 #> 2 B        19  

If your code is in a package function, you can @importFrom rlang .data to avoid R check notes about undefined globals.

ORIGINAL QUESTION:

I want to refer to an unknown column name inside a summarise. The standard evaluation functions introduced in dplyr 0.3 allow column names to be referenced using variables, but this doesn't appear to work when you call a base R function within e.g. a summarise.

library(dplyr)   key <- "v3" val <- "v2" drp <- "v1"   df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2))) 

The df looks like this:

> df Source: local data frame [5 x 3]    v1 v2 v3 1  1  6  A 2  2  7  A 3  3  8  A 4  4  9  B 5  5 10  B 

I want to drop v1, group by v3, and sum v2 for each group:

df %>% select(-matches(drp)) %>% group_by_(key) %>% summarise_(sum(val, na.rm = TRUE))  Error in sum(val, na.rm = TRUE) : invalid 'type' (character) of argument 

The NSE version of select() works fine, since it can match a character string. The SE version of group_by() works fine, since it can now accept variables as arguments and evaluate them. However, I haven't found a way to achieve similar results when using base R functions inside dplyr functions.

Things that don't work:

df %>% group_by_(key) %>% summarise_(sum(get(val), na.rm = TRUE)) Error in get(val) : object 'v2' not found  df %>% group_by_(key) %>% summarise_(sum(eval(as.symbol(val)), na.rm = TRUE)) Error in eval(expr, envir, enclos) : object 'v2' not found 

I've checked out several related questions, but none of the proposed solutions have worked for me so far.

like image 415
Ajar Avatar asked Nov 03 '14 22:11

Ajar


1 Answers

Please note that this answer does not apply to dplyr >= 0.7.0, but to previous versions.

[dplyr 0.7.0] has a new approach to non-standard evaluation (NSE) called tidyeval. It is described in detail in vignette("programming").


The dplyr vignette on non-standard evalutation is helpful here. Check the section "Mixing constants and variables" and you find that the function interp from package lazyeval could be used, and "[u]se as.name if you have a character string that gives a variable name":

library(lazyeval) df %>%   select(-matches(drp)) %>%   group_by_(key) %>%   summarise_(sum_val = interp(~sum(var, na.rm = TRUE), var = as.name(val))) #   v3 sum_val # 1  A      21 # 2  B      19 
like image 171
Henrik Avatar answered Sep 20 '22 18:09

Henrik