how can I do this calculation:
library(ddply)
quantile(baseball$ab)
0% 25% 50% 75% 100%
0 25 131 435 705
by groups, say by "team"? I want a data.frame with rownames "team" and column names "0% 25% 50% 75% 100%", i.e. one quantile
call per group.
doing
ddply(baseball,"team",quantile(ab))
is not the correct solution. my problem is that the OUTPUT of each grouped operation is a vector of length 5 here.
in other words, what's a neat solution to this (nevermind the header):
m=data.frame()
for (i in unique(baseball$team)){m=rbind(m,quantile(baseball[baseball$team==i, ]$ab))}
head(m,3)
X120 X120.1 X120.2 X120.3 X120.4
1 120 120.0 120.0 120.00 120
2 162 162.0 162.0 162.00 162
3 89 89.0 89.0 89.00 89
To group data, we use dplyr module. This module contains a function called group_by() in which the column to be grouped by has to be passed. To find quantiles of the grouped data we will call summarize method with quantiles() function.
There are three quartile values—a lower quartile, median, and upper quartile—to divide the data set into four ranges, each containing 25% of the data points. The lower quartile, or first quartile, is denoted as Q1 and is the middle number that falls between the smallest value of the dataset and the median.
Percentiles are given as percent values, values such as 95%, 40%, or 27%. Quantiles are given as decimal values, values such as 0.95, 0.4, and 0.27. The 0.95 quantile point is exactly the same as the 95th percentile point.
A quartile is a type of quantile. Quantiles are values that split sorted data or a probability distribution into equal parts. In general terms, a q-quantile divides sorted data into q parts.
With base R
you could use tapply
and do.call
library(plyr)
do.call("rbind", tapply(baseball$ab, baseball$team, quantile))
do.call("rbind", tapply(baseball$ab, baseball$team, quantile, c(0.05, 0.1, 0.2)))
Or, with ddply
ddply(baseball, .(team), function(x) quantile(x$ab))
A slightly different approach using dplyr
:
library(tidyverse)
baseball %>%
group_by(team) %>%
nest() %>%
mutate(
ret = map(data, ~quantile(.$ab, probs = c(0.25, 0.75))),
ret = invoke_map(tibble, ret)
) %>%
unnest(ret)
Here you can specify the needed quantiles in the probs
argument.
The invoke_map
call seems to be necessary, as quantile
does not return a data frame; see this answer.
You can also put that all into a function:
get_quantiles <- function(.data, .var, .probs = c(0.25, 0.75), .group_vars = vars()) {
.var = deparse(substitute(.var))
return(
.data %>%
group_by_at(.group_vars) %>%
nest() %>%
mutate(
ret = map(data, ~quantile(.[[.var]], probs = .probs)),
ret = invoke_map(tibble, ret)
) %>%
unnest(ret, .drop = TRUE)
)
}
mtcars %>% get_quantiles(wt, .group_vars = vars(cyl))
A new approach would be to use group_modify()
from dplyr
. Then you'd call:
baseball %>%
group_by(team) %>%
group_modify(~{
quantile(.x$ab, probs = c(0.25, 0.75)) %>%
tibble::enframe()
}) %>%
spread(name, value)
You should define the calculation for each quantile separately and use summarise
. Also use .(team)
.
library(plyr)
data(baseball)
ddply(baseball,.(team),summarise, X0 = quantile(ab, probs = 0), X25 = quantile(ab, probs = 0.25), X50 = quantile(ab, probs = 0.50), X75 = quantile(ab, probs = 0.75), X100 = quantile(ab, probs = 1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With