I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions. Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code: <pre class="prettyprint"><code>library(dplyr) library(tidyr) # load data data("mtcars") # Percentiles used in calculation p <- c(.25,.5,.75) # old dplyr solution mtcars %>% group_by(cyl) %>% do(data.frame(p=p, stats=quantile(.$mpg, probs=p), n = length(.$mpg), avg = mean(.$mpg))) %>% spread(p, stats) %>% select(1, 4:6, 3, 2) # note: the select and spread statements are just to get the data into # the format in which I'd like to see it, but are not critical </code></pre> Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement. Thank you

In <code>dplyr 1.0</code>, <code>summarise</code> can return multiple values, allowing the following: <pre class="prettyprint"><code>library(tidyverse) mtcars %>% group_by(cyl) %>% summarise(quantile = scales::percent(c(0.25, 0.5, 0.75)), mpg = quantile(mpg, c(0.25, 0.5, 0.75))) </code></pre> Or, you can avoid a separate line to name the quantiles by going with <code>enframe</code>: <pre class="prettyprint"><code>mtcars %>% group_by(cyl) %>% summarise(enframe(quantile(mpg, c(0.25, 0.5, 0.75)), "quantile", "mpg")) </code></pre> <blockquote> <pre class="prettyprint"><code> cyl quantile mpg <dbl> <chr> <dbl> 1 4 25% 22.8 2 4 50% 26 3 4 75% 30.4 4 6 25% 18.6 5 6 50% 19.7 6 6 75% 21 7 8 25% 14.4 8 8 50% 15.2 9 8 75% 16.2 </code></pre> </blockquote> <hr> Answer for previous versions of <code>dplyr</code> <pre class="prettyprint"><code>library(tidyverse) mtcars %>% group_by(cyl) %>% summarise(x=list(enframe(quantile(mpg, probs=c(0.25,0.5,0.75)), "quantiles", "mpg"))) %>% unnest(x) </code></pre> <blockquote> <pre class="prettyprint"><code> cyl quantiles mpg 1 4 25% 22.80 2 4 50% 26.00 3 4 75% 30.40 4 6 25% 18.65 5 6 50% 19.70 6 6 75% 21.00 7 8 25% 14.40 8 8 50% 15.20 9 8 75% 16.25 </code></pre> </blockquote> This can be turned into a more general function using tidyeval: <pre class="prettyprint"><code>q_by_group = function(data, value.col, ..., probs=seq(0,1,0.25)) { groups=enquos(...) data %>% group_by(!!!groups) %>% summarise(x = list(enframe(quantile({{value.col}}, probs=probs), "quantiles", "mpg"))) %>% unnest(x) } q_by_group(mtcars, mpg) q_by_group(mtcars, mpg, cyl) q_by_group(mtcars, mpg, cyl, vs, probs=c(0.5,0.75)) q_by_group(iris, Petal.Width, Species) </code></pre>

If you're up for using <code>purrr::map</code>, you can do it like this! <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyverse) mtcars %>% tbl_df() %>% nest(-cyl) %>% mutate(Quantiles = map(data, ~ quantile(.$mpg)), Quantiles = map(Quantiles, ~ bind_rows(.) %>% gather())) %>% unnest(Quantiles) #> # A tibble: 15 x 3 #> cyl key value #> <dbl> <chr> <dbl> #> 1 6 0% 17.8 #> 2 6 25% 18.6 #> 3 6 50% 19.7 #> 4 6 75% 21 #> 5 6 100% 21.4 #> 6 4 0% 21.4 #> 7 4 25% 22.8 #> 8 4 50% 26 #> 9 4 75% 30.4 #> 10 4 100% 33.9 #> 11 8 0% 10.4 #> 12 8 25% 14.4 #> 13 8 50% 15.2 #> 14 8 75% 16.2 #> 15 8 100% 19.2 </code></pre> Created on 2018-11-10 by the reprex package (v0.2.1) One nice thing about this approach is the output is tidy, one observation per row.

Using dplyr window functions to calculate percentiles

Tags:

r

dplyr

tidyr

I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.

Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:

library(dplyr) library(tidyr)  # load data data("mtcars")  # Percentiles used in calculation p <- c(.25,.5,.75)  # old dplyr solution  mtcars %>% group_by(cyl) %>%    do(data.frame(p=p, stats=quantile(.$mpg, probs=p),                  n = length(.$mpg), avg = mean(.$mpg))) %>%   spread(p, stats) %>%   select(1, 4:6, 3, 2)  # note: the select and spread statements are just to get the data into #       the format in which I'd like to see it, but are not critical

Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.

Thank you

647

asked May 27 '15 16:05

dreww2

2 Answers

In dplyr 1.0, summarise can return multiple values, allowing the following:

library(tidyverse)  mtcars %>%    group_by(cyl) %>%     summarise(quantile = scales::percent(c(0.25, 0.5, 0.75)),             mpg = quantile(mpg, c(0.25, 0.5, 0.75)))

Or, you can avoid a separate line to name the quantiles by going with enframe:

mtcars %>%    group_by(cyl) %>%     summarise(enframe(quantile(mpg, c(0.25, 0.5, 0.75)), "quantile", "mpg"))

    cyl quantile   mpg   <dbl> <chr>    <dbl> 1     4 25%       22.8 2     4 50%       26   3     4 75%       30.4 4     6 25%       18.6 5     6 50%       19.7 6     6 75%       21   7     8 25%       14.4 8     8 50%       15.2 9     8 75%       16.2

Answer for previous versions of dplyr

library(tidyverse)  mtcars %>%    group_by(cyl) %>%    summarise(x=list(enframe(quantile(mpg, probs=c(0.25,0.5,0.75)), "quantiles", "mpg"))) %>%    unnest(x)

    cyl quantiles   mpg 1     4       25% 22.80 2     4       50% 26.00 3     4       75% 30.40 4     6       25% 18.65 5     6       50% 19.70 6     6       75% 21.00 7     8       25% 14.40 8     8       50% 15.20 9     8       75% 16.25

This can be turned into a more general function using tidyeval:

q_by_group = function(data, value.col, ..., probs=seq(0,1,0.25)) {    groups=enquos(...)      data %>%      group_by(!!!groups) %>%      summarise(x = list(enframe(quantile({{value.col}}, probs=probs), "quantiles", "mpg"))) %>%      unnest(x) }  q_by_group(mtcars, mpg) q_by_group(mtcars, mpg, cyl) q_by_group(mtcars, mpg, cyl, vs, probs=c(0.5,0.75)) q_by_group(iris, Petal.Width, Species)

194

answered Sep 25 '22 08:09

eipi10

If you're up for using purrr::map, you can do it like this!

library(tidyverse)  mtcars %>%   tbl_df() %>%   nest(-cyl) %>%   mutate(Quantiles = map(data, ~ quantile(.$mpg)),          Quantiles = map(Quantiles, ~ bind_rows(.) %>% gather())) %>%    unnest(Quantiles)  #> # A tibble: 15 x 3 #>      cyl key   value #>    <dbl> <chr> <dbl> #>  1     6 0%     17.8 #>  2     6 25%    18.6 #>  3     6 50%    19.7 #>  4     6 75%    21   #>  5     6 100%   21.4 #>  6     4 0%     21.4 #>  7     4 25%    22.8 #>  8     4 50%    26   #>  9     4 75%    30.4 #> 10     4 100%   33.9 #> 11     8 0%     10.4 #> 12     8 25%    14.4 #> 13     8 50%    15.2 #> 14     8 75%    16.2 #> 15     8 100%   19.2

^{Created on 2018-11-10 by the reprex package (v0.2.1)}

One nice thing about this approach is the output is tidy, one observation per row.

answered Sep 25 '22 08:09

Julia Silge

Related questions
                            
                                dply: order columns alphabetically in R
                            
                                Add a dotted vertical line on certain x-axis values using ggplot
                            
                                How to get a regression summary in scikit-learn like R does?
                            
                                case_when in mutate pipe
                            
                                Avoid wasting space when placing multiple aligned plots onto one page
                            
                                Change timezone in a POSIXct object
                            
                                When documenting in Roxygen: How do I make an itemized list in @details?
                            
                                Converting a data frame to xts
                            
                                best way to transpose data.table
                            
                                Moving files between folders
                            
                                Extract info inside all parenthesis in R
                            
                                Creating a named list from two vectors (names, values)
                            
                                Adding table within the plotting region of a ggplot in r
                            
                                Installing rgl on Ubuntu and Mac: X11 not found
                            
                                ggplot2 pdf import in Adobe Illustrator missing font AdobePiStd
                            
                                Normalizing y-axis in histograms in R ggplot to proportion
                            
                                Concatenate strings by group with dplyr [duplicate]
                            
                                count number of rows in a data frame in R based on group [duplicate]
                            
                                Importing data into R from google spreadsheet
                            
                                Select the first and last row by group in a data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With