When using dplyr
to create a table of summary statistics that is organized by levels of a variable, I cannot figure out the syntax for calculating quartiles without having to repeat the column name. That is, using calls, such as vars()
and list()
work with other functions, such as mean()
and median()
but not with quantile()
Searches have produced antiquated solutions that no longer work because they use deprecated calls, such as do()
and/or funs()
.
data(iris)
library(tidyverse)
#This works: Notice I have not attempted to calculate quartiles yet
summary_stat <- iris %>%
group_by(Species) %>%
summarise_at(vars(Sepal.Length),
list(min=min, median=median, max=max,
mean=mean, sd=sd)
)
A tibble: 3 x 6
Species min median max mean sd
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 5 5.8 5.01 0.352
2 versicolor 4.9 5.9 7 5.94 0.516
3 virginica 4.9 6.5 7.9 6.59 0.636
##########################################################################
#Does NOT work:
five_number_summary <- iris %>%
group_by(Species) %>%
summarise_at(vars(Sepal.Length),
list(min=min, Q1=quantile(.,probs = 0.25),
median=median, Q3=quantile(., probs = 0.75),
max=max))
Error: Must use a vector in `[`, not an object of class matrix.
Call `rlang::last_error()` to see a backtrace
###########################################################################
#This works: Remove the vars() argument, remove the list() argument,
#replace summarise_at() with summarise()
#but the code requires repeating the column name (Sepal.Length)
five_number_summary <- iris %>%
group_by(Species) %>%
summarise(min=min(Sepal.Length),
Q1=quantile(Sepal.Length,probs = 0.25),
median=median(Sepal.Length),
Q3=quantile(Sepal.Length, probs = 0.75),
max=max(Sepal.Length))
# A tibble: 3 x 6
Species min Q1 median Q3 max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 4.8 5 5.2 5.8
2 versicolor 4.9 5.6 5.9 6.3 7
3 virginica 4.9 6.22 6.5 6.9 7.9
This last piece of code produces exactly what I am looking for, but I am wondering why there isn't a shorter syntax that doesn't force me to repeat the variable.
To group data, we use dplyr module. This module contains a function called group_by() in which the column to be grouped by has to be passed. To find quantiles of the grouped data we will call summarize method with quantiles() function.
A quartile is a type of quantile. Quantiles are values that split sorted data or a probability distribution into equal parts. In general terms, a q-quantile divides sorted data into q parts.
Quantiles are points in a distribution that relate to the rank order of values in that distribution. For a sample, you can find any quantile by sorting the sample. The middle value of the sorted sample (middle quantile, 50th percentile) is known as the median.
In statistics, quantiles are values that divide a ranked dataset into equal groups. The quantile() function in R can be used to calculate sample quantiles of a dataset. This function uses the following basic syntax: quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE)
You're missing the ~
in front of the quantile
function in the summarise_at
call that failed. Try the following:
five_number_summary <- iris %>%
group_by(Species) %>%
summarise_at(vars(Sepal.Length),
list(min=min, Q1=~quantile(., probs = 0.25),
median=median, Q3=~quantile(., probs = 0.75),
max=max))
five_number_summary
# A tibble: 3 x 6
Species min Q1 median Q3 max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 4.8 5 5.2 5.8
2 versicolor 4.9 5.6 5.9 6.3 7
3 virginica 4.9 6.22 6.5 6.9 7.9
You can create a list column and then use unnest_wider
, which requires tidyr 1.0.0
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise(q = list(quantile(Sepal.Length))) %>%
unnest_wider(q)
# # A tibble: 3 x 6
# Species `0%` `25%` `50%` `75%` `100%`
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 4.3 4.8 5 5.2 5.8
# 2 versicolor 4.9 5.6 5.9 6.3 7
# 3 virginica 4.9 6.22 6.5 6.9 7.9
There's a names_repair
argument, but apparently that changes the name of all the columns, and not just the ones being unnested (??)
iris %>%
group_by(Species) %>%
summarise(q = list(quantile(Sepal.Length))) %>%
unnest_wider(q, names_repair = ~paste0('Q_', sub('%', '', .)))
# # A tibble: 3 x 6
# Q_Species Q_0 Q_25 Q_50 Q_75 Q_100
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 4.3 4.8 5 5.2 5.8
# 2 versicolor 4.9 5.6 5.9 6.3 7
# 3 virginica 4.9 6.22 6.5 6.9 7.9
Another option is group_modify
iris %>%
group_by(Species) %>%
group_modify(~as.data.frame(t(quantile(.$Sepal.Length))))
# # A tibble: 3 x 6
# # Groups: Species [3]
# Species `0%` `25%` `50%` `75%` `100%`
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 4.3 4.8 5 5.2 5.8
# 2 versicolor 4.9 5.6 5.9 6.3 7
# 3 virginica 4.9 6.22 6.5 6.9 7.9
Or you could use data.table
library(data.table)
irisdt <- as.data.table(iris)
irisdt[, as.list(quantile(Sepal.Length)), Species]
# Species 0% 25% 50% 75% 100%
# 1: setosa 4.3 4.800 5.0 5.2 5.8
# 2: versicolor 4.9 5.600 5.9 6.3 7.0
# 3: virginica 4.9 6.225 6.5 6.9 7.9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With