Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dplyr window functions to calculate percentiles

Tags:

r

dplyr

tidyr

I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.

Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:

library(dplyr) library(tidyr)  # load data data("mtcars")  # Percentiles used in calculation p <- c(.25,.5,.75)  # old dplyr solution  mtcars %>% group_by(cyl) %>%    do(data.frame(p=p, stats=quantile(.$mpg, probs=p),                  n = length(.$mpg), avg = mean(.$mpg))) %>%   spread(p, stats) %>%   select(1, 4:6, 3, 2)  # note: the select and spread statements are just to get the data into #       the format in which I'd like to see it, but are not critical 

Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.

Thank you

like image 647
dreww2 Avatar asked May 27 '15 16:05

dreww2


People also ask

How do you calculate percentile rank in R?

You can use 'percent_rank' function to get the percentile calculation. In Exploratory, you can simply select 'Create Window Calculation' -> 'Rank' -> 'Percent Rank' from the menu of 'numbers_per_k' column in this case. Once you run it, the calculation is done for each row.

Is there a percentile function in R?

You find a percentile in R by using the quantiles function. It produces the percentage with the value that is the percentile. This is the default version of this function, and it produces the 0th percentile, 25th percentile, 50th percentile, 75th percentile, and 100th percentile.

How do you find the 2.5th percentile in R?

Data Visualization using R Programming To find the 2.5th percentile, we would need to use the probability = 0.025 and for the 97.5th percentile we can use probability = 0.0975.


2 Answers

In dplyr 1.0, summarise can return multiple values, allowing the following:

library(tidyverse)  mtcars %>%    group_by(cyl) %>%     summarise(quantile = scales::percent(c(0.25, 0.5, 0.75)),             mpg = quantile(mpg, c(0.25, 0.5, 0.75))) 

Or, you can avoid a separate line to name the quantiles by going with enframe:

mtcars %>%    group_by(cyl) %>%     summarise(enframe(quantile(mpg, c(0.25, 0.5, 0.75)), "quantile", "mpg")) 
    cyl quantile   mpg   <dbl> <chr>    <dbl> 1     4 25%       22.8 2     4 50%       26   3     4 75%       30.4 4     6 25%       18.6 5     6 50%       19.7 6     6 75%       21   7     8 25%       14.4 8     8 50%       15.2 9     8 75%       16.2 

Answer for previous versions of dplyr

library(tidyverse)  mtcars %>%    group_by(cyl) %>%    summarise(x=list(enframe(quantile(mpg, probs=c(0.25,0.5,0.75)), "quantiles", "mpg"))) %>%    unnest(x) 
    cyl quantiles   mpg 1     4       25% 22.80 2     4       50% 26.00 3     4       75% 30.40 4     6       25% 18.65 5     6       50% 19.70 6     6       75% 21.00 7     8       25% 14.40 8     8       50% 15.20 9     8       75% 16.25 

This can be turned into a more general function using tidyeval:

q_by_group = function(data, value.col, ..., probs=seq(0,1,0.25)) {    groups=enquos(...)      data %>%      group_by(!!!groups) %>%      summarise(x = list(enframe(quantile({{value.col}}, probs=probs), "quantiles", "mpg"))) %>%      unnest(x) }  q_by_group(mtcars, mpg) q_by_group(mtcars, mpg, cyl) q_by_group(mtcars, mpg, cyl, vs, probs=c(0.5,0.75)) q_by_group(iris, Petal.Width, Species) 
like image 194
eipi10 Avatar answered Sep 25 '22 08:09

eipi10


If you're up for using purrr::map, you can do it like this!

library(tidyverse)  mtcars %>%   tbl_df() %>%   nest(-cyl) %>%   mutate(Quantiles = map(data, ~ quantile(.$mpg)),          Quantiles = map(Quantiles, ~ bind_rows(.) %>% gather())) %>%    unnest(Quantiles)  #> # A tibble: 15 x 3 #>      cyl key   value #>    <dbl> <chr> <dbl> #>  1     6 0%     17.8 #>  2     6 25%    18.6 #>  3     6 50%    19.7 #>  4     6 75%    21   #>  5     6 100%   21.4 #>  6     4 0%     21.4 #>  7     4 25%    22.8 #>  8     4 50%    26   #>  9     4 75%    30.4 #> 10     4 100%   33.9 #> 11     8 0%     10.4 #> 12     8 25%    14.4 #> 13     8 50%    15.2 #> 14     8 75%    16.2 #> 15     8 100%   19.2 

Created on 2018-11-10 by the reprex package (v0.2.1)

One nice thing about this approach is the output is tidy, one observation per row.

like image 27
Julia Silge Avatar answered Sep 25 '22 08:09

Julia Silge