I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.
Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:
library(dplyr) library(tidyr) # load data data("mtcars") # Percentiles used in calculation p <- c(.25,.5,.75) # old dplyr solution mtcars %>% group_by(cyl) %>% do(data.frame(p=p, stats=quantile(.$mpg, probs=p), n = length(.$mpg), avg = mean(.$mpg))) %>% spread(p, stats) %>% select(1, 4:6, 3, 2) # note: the select and spread statements are just to get the data into # the format in which I'd like to see it, but are not critical
Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.
Thank you
You can use 'percent_rank' function to get the percentile calculation. In Exploratory, you can simply select 'Create Window Calculation' -> 'Rank' -> 'Percent Rank' from the menu of 'numbers_per_k' column in this case. Once you run it, the calculation is done for each row.
You find a percentile in R by using the quantiles function. It produces the percentage with the value that is the percentile. This is the default version of this function, and it produces the 0th percentile, 25th percentile, 50th percentile, 75th percentile, and 100th percentile.
Data Visualization using R Programming To find the 2.5th percentile, we would need to use the probability = 0.025 and for the 97.5th percentile we can use probability = 0.0975.
In dplyr 1.0
, summarise
can return multiple values, allowing the following:
library(tidyverse) mtcars %>% group_by(cyl) %>% summarise(quantile = scales::percent(c(0.25, 0.5, 0.75)), mpg = quantile(mpg, c(0.25, 0.5, 0.75)))
Or, you can avoid a separate line to name the quantiles by going with enframe
:
mtcars %>% group_by(cyl) %>% summarise(enframe(quantile(mpg, c(0.25, 0.5, 0.75)), "quantile", "mpg"))
cyl quantile mpg <dbl> <chr> <dbl> 1 4 25% 22.8 2 4 50% 26 3 4 75% 30.4 4 6 25% 18.6 5 6 50% 19.7 6 6 75% 21 7 8 25% 14.4 8 8 50% 15.2 9 8 75% 16.2
Answer for previous versions of dplyr
library(tidyverse) mtcars %>% group_by(cyl) %>% summarise(x=list(enframe(quantile(mpg, probs=c(0.25,0.5,0.75)), "quantiles", "mpg"))) %>% unnest(x)
cyl quantiles mpg 1 4 25% 22.80 2 4 50% 26.00 3 4 75% 30.40 4 6 25% 18.65 5 6 50% 19.70 6 6 75% 21.00 7 8 25% 14.40 8 8 50% 15.20 9 8 75% 16.25
This can be turned into a more general function using tidyeval:
q_by_group = function(data, value.col, ..., probs=seq(0,1,0.25)) { groups=enquos(...) data %>% group_by(!!!groups) %>% summarise(x = list(enframe(quantile({{value.col}}, probs=probs), "quantiles", "mpg"))) %>% unnest(x) } q_by_group(mtcars, mpg) q_by_group(mtcars, mpg, cyl) q_by_group(mtcars, mpg, cyl, vs, probs=c(0.5,0.75)) q_by_group(iris, Petal.Width, Species)
If you're up for using purrr::map
, you can do it like this!
library(tidyverse) mtcars %>% tbl_df() %>% nest(-cyl) %>% mutate(Quantiles = map(data, ~ quantile(.$mpg)), Quantiles = map(Quantiles, ~ bind_rows(.) %>% gather())) %>% unnest(Quantiles) #> # A tibble: 15 x 3 #> cyl key value #> <dbl> <chr> <dbl> #> 1 6 0% 17.8 #> 2 6 25% 18.6 #> 3 6 50% 19.7 #> 4 6 75% 21 #> 5 6 100% 21.4 #> 6 4 0% 21.4 #> 7 4 25% 22.8 #> 8 4 50% 26 #> 9 4 75% 30.4 #> 10 4 100% 33.9 #> 11 8 0% 10.4 #> 12 8 25% 14.4 #> 13 8 50% 15.2 #> 14 8 75% 16.2 #> 15 8 100% 19.2
Created on 2018-11-10 by the reprex package (v0.2.1)
One nice thing about this approach is the output is tidy, one observation per row.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With