Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables. Simplified version of my function... <pre class="prettyprint"><code>my_summarise <- function(df = temp, to.sum = 'eg1', ...){ ## Summarise results <- summarise_(df, n = ~n(), mean = mean(~to.sum, na.rm = TRUE)) return(results) } </code></pre> And running it with some dummy data... <pre class="prettyprint"><code>set.seed(43290) temp <- cbind(rnorm(n = 100, mean = 2, sd = 4), rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame() names(temp) <- c('eg1', 'eg2') mean(temp$eg1) [1] 1.881721 mean(temp$eg2) [1] 3.575819 my_summarise(df = temp, to.sum = 'eg1') n mean 1 100 NA </code></pre> N is calculated, but the mean is not, can't figure out why. Ultimately I'd like my function to be more general, along the lines of... <pre class="prettyprint"><code>my_summarise <- function(df = temp, group.by = 'group' to.sum = c('eg1', 'eg2'), ...){ results <- list() ## Select columns df <- dplyr::select_(df, .dots = c(group.by, to.sum)) ## Summarise overall results$all <- summarise_each(df, funs(n = ~n(), mean = mean(~to.sum, na.rm = TRUE))) ## Summarise by specified group results$by.group <- group_by_(df, ~to.group) %>% summarise_each(df, funs(n = ~n(), mean = mean(~to.sum, na.rm = TRUE))) return(results) } </code></pre> ...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to <code>dplyr::select()</code> works ok. Appreciate any advice as to where I'm going wrong. Thanks in advance

The basic idea is that you have to actually build the appropriate call yourself, most easily done with the <code>lazyeval</code> package. In this case you want to programmatically create a call that looks like <code>~mean(eg1, na.rm = TRUE)</code>. This is how: <pre class="prettyprint"><code>my_summarise <- function(df = temp, to.sum = 'eg1', ...){ ## Summarise results <- summarise_(df, n = ~n(), mean = lazyeval::interp(~mean(x, na.rm = TRUE), x = as.name(to.sum))) return(results) } </code></pre> Here is what I do when I struggle to get things working: <ol> <li>Remember that, just like the <code>~n()</code> you already have, the call will have to start with a <code>~</code>.</li> <li>Write the correct call with the actual variable and see if it works (<code>~mean(eg1, na.rm = TRUE)</code>).</li> <li>Use <code>lazyeval::interp</code> to recreate that call, and check this by running only the <code>interp</code> to visually see what it is doing.</li> </ol> In this case I would probably often write <code>interp(~mean(x, na.rm = TRUE), x = to.sum)</code>. But running that will give us <code>~mean("eg1", na.rm = TRUE)</code> which is treating <code>eg1</code> as a character instead of a variable name. So we use <code>as.name</code>, as is taught to us in <code>vignette("nse")</code>.

Using dplyr within a function, non-standard evaluation

Tags:

r

dplyr

nse

Trying to get my head around Non-Standard Evaluation as used by dplyr but without success. I'd like a short function that returns summary statistics (N, mean, sd, median, IQR, min, max) for a specified set of variables.

Simplified version of my function...

my_summarise <- function(df = temp,
                         to.sum = 'eg1',
                         ...){
    ## Summarise
    results <- summarise_(df,
                          n = ~n(),
                          mean = mean(~to.sum, na.rm = TRUE))
    return(results)
}

And running it with some dummy data...

set.seed(43290)
temp <- cbind(rnorm(n = 100, mean = 2, sd = 4),
              rnorm(n = 100, mean = 3, sd = 6)) %>% as.data.frame()
names(temp) <- c('eg1', 'eg2')
mean(temp$eg1)
  [1] 1.881721
mean(temp$eg2)
  [1] 3.575819
my_summarise(df = temp, to.sum = 'eg1')
    n mean
1 100   NA

N is calculated, but the mean is not, can't figure out why.

Ultimately I'd like my function to be more general, along the lines of...

my_summarise <- function(df = temp,
                         group.by = 'group'
                         to.sum = c('eg1', 'eg2'),
                         ...){
    results <- list()
    ## Select columns
    df <- dplyr::select_(df, .dots = c(group.by, to.sum))
    ## Summarise overall
    results$all <- summarise_each(df,
                                  funs(n = ~n(),
                                       mean = mean(~to.sum, na.rm = TRUE)))
    ## Summarise by specified group
    results$by.group <- group_by_(df, ~to.group) %>%
                        summarise_each(df,
                                       funs(n = ~n(),
                                       mean = mean(~to.sum, na.rm = TRUE)))        
    return(results)
}

...but before I move onto this more complex version (which I was using this example for guidance) I need to get the evaluation working in the simple version first as thats the stumbling block, the call to dplyr::select() works ok.

Appreciate any advice as to where I'm going wrong.

Thanks in advance

569

asked Oct 13 '16 09:10

slackline

1 Answers

The basic idea is that you have to actually build the appropriate call yourself, most easily done with the lazyeval package.

In this case you want to programmatically create a call that looks like ~mean(eg1, na.rm = TRUE). This is how:

my_summarise <- function(df = temp,
                         to.sum = 'eg1',
                         ...){
  ## Summarise
  results <- summarise_(df,
                        n = ~n(),
                        mean = lazyeval::interp(~mean(x, na.rm = TRUE),
                                                x = as.name(to.sum)))
  return(results)
}

Here is what I do when I struggle to get things working:

Remember that, just like the ~n() you already have, the call will have to start with a ~.
Write the correct call with the actual variable and see if it works (~mean(eg1, na.rm = TRUE)).
Use lazyeval::interp to recreate that call, and check this by running only the interp to visually see what it is doing.

In this case I would probably often write interp(~mean(x, na.rm = TRUE), x = to.sum). But running that will give us ~mean("eg1", na.rm = TRUE) which is treating eg1 as a character instead of a variable name. So we use as.name, as is taught to us in vignette("nse").

answered Sep 20 '22 04:09

Axeman

Related questions
                            
                                geom_histogram: wrong bins?
                            
                                Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?
                            
                                How to add new slot to already existing class?
                            
                                Multi node cluster installation with h2o on AWS EC2
                            
                                R shiny sliderInput with restricted range
                            
                                How do I impute missing variables in R using dplyr?
                            
                                tab specific sidebar in shinydashboard
                            
                                R tm substitute words in Corpus using gsub
                            
                                Polar Bar Plot, with inner-most circle empty Using R
                            
                                Change confidence interval format in package metafor forest graph?
                            
                                libicu and stringi on Fedora 24 causing R headaches
                            
                                GLMER warning: variance-covariance matrix [...] is not positive definite or contains NA values
                            
                                Using pseudocolour in ggplot2 scatter plot to indicate density
                            
                                Inserting a blank page after a title page in RMarkdown
                            
                                dealing with the datetime value in R
                            
                                R, mutate and "Unsupported type NILSXP for column"
                            
                                Weighted Euclidean Distance in R
                            
                                R how to find the intersection of a subest of vectors in a list
                            
                                Error with svychisq - 'contrast can be applied to factors with 2 or more levels'
                            
                                Removing the last trailing underscore from a string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With