When using <code>dplyr</code> to create a table of summary statistics that is organized by levels of a variable, I cannot figure out the syntax for calculating quartiles without having to repeat the column name. That is, using calls, such as <code>vars()</code> and <code>list()</code> work with other functions, such as <code>mean()</code> and <code>median()</code> but not with <code>quantile()</code> Searches have produced antiquated solutions that no longer work because they use deprecated calls, such as <code>do()</code> and/or <code>funs()</code>. <pre class="prettyprint"><code>data(iris) library(tidyverse) #This works: Notice I have not attempted to calculate quartiles yet summary_stat <- iris %>% group_by(Species) %>% summarise_at(vars(Sepal.Length), list(min=min, median=median, max=max, mean=mean, sd=sd) ) A tibble: 3 x 6 Species min median max mean sd <fct> <dbl> <dbl> <dbl> <dbl> <dbl> 1 setosa 4.3 5 5.8 5.01 0.352 2 versicolor 4.9 5.9 7 5.94 0.516 3 virginica 4.9 6.5 7.9 6.59 0.636 ########################################################################## #Does NOT work: five_number_summary <- iris %>% group_by(Species) %>% summarise_at(vars(Sepal.Length), list(min=min, Q1=quantile(.,probs = 0.25), median=median, Q3=quantile(., probs = 0.75), max=max)) Error: Must use a vector in `[`, not an object of class matrix. Call `rlang::last_error()` to see a backtrace ########################################################################### #This works: Remove the vars() argument, remove the list() argument, #replace summarise_at() with summarise() #but the code requires repeating the column name (Sepal.Length) five_number_summary <- iris %>% group_by(Species) %>% summarise(min=min(Sepal.Length), Q1=quantile(Sepal.Length,probs = 0.25), median=median(Sepal.Length), Q3=quantile(Sepal.Length, probs = 0.75), max=max(Sepal.Length)) # A tibble: 3 x 6 Species min Q1 median Q3 max <fct> <dbl> <dbl> <dbl> <dbl> <dbl> 1 setosa 4.3 4.8 5 5.2 5.8 2 versicolor 4.9 5.6 5.9 6.3 7 3 virginica 4.9 6.22 6.5 6.9 7.9 </code></pre> This last piece of code produces exactly what I am looking for, but I am wondering why there isn't a shorter syntax that doesn't force me to repeat the variable.

You can create a list column and then use <code>unnest_wider</code>, which requires tidyr 1.0.0 <pre class="prettyprint"><code>library(tidyverse) iris %>% group_by(Species) %>% summarise(q = list(quantile(Sepal.Length))) %>% unnest_wider(q) # # A tibble: 3 x 6 # Species `0%` `25%` `50%` `75%` `100%` # <fct> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 setosa 4.3 4.8 5 5.2 5.8 # 2 versicolor 4.9 5.6 5.9 6.3 7 # 3 virginica 4.9 6.22 6.5 6.9 7.9 </code></pre> There's a <code>names_repair</code> argument, but apparently that changes the name of all the columns, and not just the ones being unnested (??) <pre class="prettyprint"><code>iris %>% group_by(Species) %>% summarise(q = list(quantile(Sepal.Length))) %>% unnest_wider(q, names_repair = ~paste0('Q_', sub('%', '', .))) # # A tibble: 3 x 6 # Q_Species Q_0 Q_25 Q_50 Q_75 Q_100 # <fct> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 setosa 4.3 4.8 5 5.2 5.8 # 2 versicolor 4.9 5.6 5.9 6.3 7 # 3 virginica 4.9 6.22 6.5 6.9 7.9 </code></pre> Another option is <code>group_modify</code> <pre class="prettyprint"><code>iris %>% group_by(Species) %>% group_modify(~as.data.frame(t(quantile(.$Sepal.Length)))) # # A tibble: 3 x 6 # # Groups: Species [3] # Species `0%` `25%` `50%` `75%` `100%` # <fct> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 setosa 4.3 4.8 5 5.2 5.8 # 2 versicolor 4.9 5.6 5.9 6.3 7 # 3 virginica 4.9 6.22 6.5 6.9 7.9 </code></pre> Or you could use data.table <pre class="prettyprint"><code>library(data.table) irisdt <- as.data.table(iris) irisdt[, as.list(quantile(Sepal.Length)), Species] # Species 0% 25% 50% 75% 100% # 1: setosa 4.3 4.800 5.0 5.2 5.8 # 2: versicolor 4.9 5.600 5.9 6.3 7.0 # 3: virginica 4.9 6.225 6.5 6.9 7.9 </code></pre>

How to get quantiles to work with summarise_at and group_by (dplyr)

Tags:

r

dplyr

quantile

When using dplyr to create a table of summary statistics that is organized by levels of a variable, I cannot figure out the syntax for calculating quartiles without having to repeat the column name. That is, using calls, such as vars() and list() work with other functions, such as mean() and median() but not with quantile()

Searches have produced antiquated solutions that no longer work because they use deprecated calls, such as do() and/or funs().

data(iris)
library(tidyverse)

#This works: Notice I have not attempted to calculate quartiles yet
summary_stat <- iris %>% 
  group_by(Species) %>% 
  summarise_at(vars(Sepal.Length), 
               list(min=min, median=median, max=max,
               mean=mean, sd=sd)
               )
A tibble: 3 x 6
  Species      min median   max  mean    sd
  <fct>      <dbl>  <dbl> <dbl> <dbl> <dbl>
1 setosa       4.3    5     5.8  5.01 0.352
2 versicolor   4.9    5.9   7    5.94 0.516
3 virginica    4.9    6.5   7.9  6.59 0.636

##########################################################################
#Does NOT work:
five_number_summary <- iris %>% 
  group_by(Species) %>% 
  summarise_at(vars(Sepal.Length),
               list(min=min, Q1=quantile(.,probs = 0.25),
                    median=median, Q3=quantile(., probs = 0.75),
                    max=max))

Error: Must use a vector in `[`, not an object of class matrix.
Call `rlang::last_error()` to see a backtrace

###########################################################################
#This works: Remove the vars() argument, remove the list() argument,
  #replace summarise_at() with summarise()
  #but the code requires repeating the column name (Sepal.Length)

five_number_summary <- iris %>% 
  group_by(Species) %>% 
  summarise(min=min(Sepal.Length), 
            Q1=quantile(Sepal.Length,probs = 0.25),
            median=median(Sepal.Length), 
            Q3=quantile(Sepal.Length, probs = 0.75),
            max=max(Sepal.Length))

# A tibble: 3 x 6
  Species      min    Q1 median    Q3   max
  <fct>      <dbl> <dbl>  <dbl> <dbl> <dbl>
1 setosa       4.3  4.8     5     5.2   5.8
2 versicolor   4.9  5.6     5.9   6.3   7  
3 virginica    4.9  6.22    6.5   6.9   7.9

This last piece of code produces exactly what I am looking for, but I am wondering why there isn't a shorter syntax that doesn't force me to repeat the variable.

478

asked Sep 17 '19 13:09

James

2 Answers

You're missing the ~ in front of the quantile function in the summarise_at call that failed. Try the following:

five_number_summary <- iris %>% 
  group_by(Species) %>% 
  summarise_at(vars(Sepal.Length),
               list(min=min, Q1=~quantile(., probs = 0.25),
                    median=median, Q3=~quantile(., probs = 0.75),
                    max=max))
five_number_summary
# A tibble: 3 x 6
  Species      min    Q1 median    Q3   max
  <fct>      <dbl> <dbl>  <dbl> <dbl> <dbl>
1 setosa       4.3  4.8     5     5.2   5.8
2 versicolor   4.9  5.6     5.9   6.3   7  
3 virginica    4.9  6.22    6.5   6.9   7.9

answered Sep 27 '22 22:09

Arienrhod

You can create a list column and then use unnest_wider, which requires tidyr 1.0.0

library(tidyverse)

iris %>% 
  group_by(Species) %>% 
  summarise(q = list(quantile(Sepal.Length))) %>% 
  unnest_wider(q)

# # A tibble: 3 x 6
#   Species     `0%` `25%` `50%` `75%` `100%`
#   <fct>      <dbl> <dbl> <dbl> <dbl>  <dbl>
# 1 setosa       4.3  4.8    5     5.2    5.8
# 2 versicolor   4.9  5.6    5.9   6.3    7  
# 3 virginica    4.9  6.22   6.5   6.9    7.9

There's a names_repair argument, but apparently that changes the name of all the columns, and not just the ones being unnested (??)

iris %>% 
  group_by(Species) %>% 
  summarise(q = list(quantile(Sepal.Length))) %>% 
  unnest_wider(q, names_repair = ~paste0('Q_', sub('%', '', .)))

# # A tibble: 3 x 6
#   Q_Species    Q_0  Q_25  Q_50  Q_75 Q_100
#   <fct>      <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa       4.3  4.8    5     5.2   5.8
# 2 versicolor   4.9  5.6    5.9   6.3   7  
# 3 virginica    4.9  6.22   6.5   6.9   7.9

Another option is group_modify

iris %>% 
  group_by(Species) %>% 
  group_modify(~as.data.frame(t(quantile(.$Sepal.Length))))

# # A tibble: 3 x 6
# # Groups:   Species [3]
#   Species     `0%` `25%` `50%` `75%` `100%`
#   <fct>      <dbl> <dbl> <dbl> <dbl>  <dbl>
# 1 setosa       4.3  4.8    5     5.2    5.8
# 2 versicolor   4.9  5.6    5.9   6.3    7  
# 3 virginica    4.9  6.22   6.5   6.9    7.9

Or you could use data.table

library(data.table)
irisdt <- as.data.table(iris)

irisdt[, as.list(quantile(Sepal.Length)), Species]
#       Species  0%   25% 50% 75% 100%
# 1:     setosa 4.3 4.800 5.0 5.2  5.8
# 2: versicolor 4.9 5.600 5.9 6.3  7.0
# 3:  virginica 4.9 6.225 6.5 6.9  7.9

answered Sep 27 '22 22:09

IceCreamToucan

Related questions
                            
                                Change output width of plotly chart size in R Markdown PDF output
                            
                                Loading shiny module only when menu items is clicked
                            
                                How can I have the search option based on typing letters in pickerInput using shinyWidgets?
                            
                                How to add a complex label with italics and a variable to ggplot?
                            
                                Stack a named Date list to data.frame
                            
                                Naive Bayes in Quanteda vs caret: wildly different results
                            
                                Mutliple formatted text on pptx by using officer package on R
                            
                                Image processing: Average grayscale images
                            
                                Unable to pass user inputs into R shiny modules
                            
                                R's equivalent of string.replace() in python
                            
                                Shiny widgets in DT Table
                            
                                R Mutate multiple columns with ifelse()-condition
                            
                                Reading numpy ndarrays into R?
                            
                                How to format the input of Shiny updated numericInput but not change the actual value?
                            
                                Extract p-value from checkresiduals function
                            
                                Converting unit abbreviations to numbers
                            
                                Change filename when downloading data from datatable R
                            
                                Using the R cut function - how do the breaks and labels options work
                            
                                Recommended way to subset two vectors with the same index vector
                            
                                Reconvert numeric date to POSIXct R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With