I'm trying to write a function that takes in the name of a data frame and a column to summarize by using dplyr, then returns the summarized data frame. I've tried a bunch of permutations of interp() from the lazyeval package, but I've spent way too much time trying to get it to work. So, I wrote a "static" version of the function I want here:
summarize.df.static <- function(){
temp_df <- mtcars %>%
group_by(cyl) %>%
summarize(qsec = mean(qsec),
mpg=mean(mpg))
return(temp_df)
}
new_df <- summarize.df.static()
head(new_df)
Here is the start of the dynamic version I'm stuck on:
summarize.df.dynamic <- function(df_in,sum_metric_in){
temp_df <- df_in %>%
group_by(cyl) %>%
summarize_(qsec = mean(qsec),
sum_metric_in=mean(sum_metric_in)) # some mix of interp()
return(temp_df)
}
new_df <- summarize.df.dynamic(mtcars,"mpg")
head(new_df)
Note that I want the column name in this example to come from the parameter passed-in as well (mpg in this case). Also note that the qsec column is static, ie not passed-in.
Below is the correct answer posted by "docendo discimus":
summarize.df.dynamic<- function(df_in, sum_metric_in){
temp_df <- df_in %>%
group_by(cyl) %>%
summarize_(qsec = ~mean(qsec),
xyz = interp(~mean(var), var = as.name(sum_metric_in)))
names(temp_df)[names(temp_df) == "xyz"] <- sum_metric_in
return(temp_df)
}
new_df <- summarize.df.dynamic(mtcars,"mpg")
head(new_df)
# cyl qsec mpg
#1 4 19.13727 26.66364
#2 6 17.97714 19.74286
#3 8 16.77214 15.10000
new_df <- summarize.df.dynamic(mtcars,"disp")
head(new_df)
# cyl qsec disp
#1 4 19.13727 105.1364
#2 6 17.97714 183.3143
#3 8 16.77214 353.1000
Non-standard evaluation shows you how subset() works by combining substitute() with eval() to allow you to succinctly select rows from a data frame. Scoping issues discusses scoping issues specific to NSE, and will show you how to resolve them.
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges: mutate() adds new variables that are functions of existing variables. select() picks variables based on their names. filter() picks cases based on their values.
The dplyr package makes these steps fast and easy: By constraining your options, it helps you think about your data manipulation challenges. It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
Similarly to readr , dplyr and tidyr are also part of the tidyverse. These packages were loaded in R's memory when we called library(tidyverse) earlier.
For the specific example (with static "qsec" etc) you could do:
library(dplyr)
library(lazyeval)
summarize.df <- function(data, sum_metric_in){
data <- data %>%
group_by(cyl) %>%
summarize_(qsec = ~mean(qsec),
xyz = interp(~mean(var), var = as.name(sum_metric_in)))
names(data)[names(data) == "xyz"] <- sum_metric_in
data
}
summarize.df(mtcars, "mpg")
#Source: local data frame [3 x 3]
#
# cyl qsec mpg
#1 4 19.13727 26.66364
#2 6 17.97714 19.74286
#3 8 16.77214 15.10000
AFAIK you cannot (yet?) supply the input "sum_metric_in" to dplyr::rename which you would typically use to rename the column, which is why I did it different in the example.
You could use paste
or ~
to get a quote input that summarize_
understands.
df_in %>%
group_by(cyl) %>%
summarize_(qsec = ~mean(qsec),
sum_metric_in=paste0('mean(', sum_metric_in, ')'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With