Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ddply multiple quantiles by group

Tags:

r

plyr

how can I do this calculation:

library(ddply)
quantile(baseball$ab)
  0%  25%  50%  75% 100% 
  0   25  131  435  705 

by groups, say by "team"? I want a data.frame with rownames "team" and column names "0% 25% 50% 75% 100%", i.e. one quantile call per group.

doing

ddply(baseball,"team",quantile(ab))

is not the correct solution. my problem is that the OUTPUT of each grouped operation is a vector of length 5 here.

in other words, what's a neat solution to this (nevermind the header):

m=data.frame()
for (i in unique(baseball$team)){m=rbind(m,quantile(baseball[baseball$team==i, ]$ab))}
head(m,3)
  X120 X120.1 X120.2 X120.3 X120.4
1  120  120.0  120.0 120.00    120
2  162  162.0  162.0 162.00    162
3   89   89.0   89.0  89.00     89
like image 847
Florian Oswald Avatar asked Mar 14 '14 11:03

Florian Oswald


People also ask

How does R calculate Quantiles by group?

To group data, we use dplyr module. This module contains a function called group_by() in which the column to be grouped by has to be passed. To find quantiles of the grouped data we will call summarize method with quantiles() function.

What are the three types of Quantiles?

There are three quartile values—a lower quartile, median, and upper quartile—to divide the data set into four ranges, each containing 25% of the data points. The lower quartile, or first quartile, is denoted as Q1 and is the middle number that falls between the smallest value of the dataset and the median.

What is the difference between quantile and percentile?

Percentiles are given as percent values, values such as 95%, 40%, or 27%. Quantiles are given as decimal values, values such as 0.95, 0.4, and 0.27. The 0.95 quantile point is exactly the same as the 95th percentile point.

Are quartiles and Quantiles same?

A quartile is a type of quantile. Quantiles are values that split sorted data or a probability distribution into equal parts. In general terms, a q-quantile divides sorted data into q parts.


3 Answers

With base R you could use tapply and do.call

library(plyr)
do.call("rbind", tapply(baseball$ab, baseball$team, quantile))

do.call("rbind", tapply(baseball$ab, baseball$team, quantile, c(0.05, 0.1, 0.2)))

Or, with ddply

ddply(baseball, .(team), function(x) quantile(x$ab))
like image 160
Patrick Hausmann Avatar answered Oct 11 '22 18:10

Patrick Hausmann


A slightly different approach using dplyr:

library(tidyverse)

baseball %>% 
  group_by(team) %>% 
  nest() %>% 
  mutate(
    ret = map(data, ~quantile(.$ab, probs = c(0.25, 0.75))),
    ret = invoke_map(tibble, ret)
  ) %>%
  unnest(ret)

Here you can specify the needed quantiles in the probs argument.

The invoke_map call seems to be necessary, as quantile does not return a data frame; see this answer.

You can also put that all into a function:

get_quantiles <- function(.data, .var, .probs = c(0.25, 0.75), .group_vars = vars()) {
  .var = deparse(substitute(.var))
  return(
    .data %>% 
    group_by_at(.group_vars) %>% 
    nest() %>% 
    mutate(
      ret = map(data, ~quantile(.[[.var]], probs = .probs)),
      ret = invoke_map(tibble, ret)
    ) %>%
    unnest(ret, .drop = TRUE)
  )
}

mtcars %>% get_quantiles(wt, .group_vars = vars(cyl))

A new approach would be to use group_modify() from dplyr. Then you'd call:

baseball %>%
  group_by(team) %>% 
  group_modify(~{
    quantile(.x$ab, probs = c(0.25, 0.75)) %>% 
    tibble::enframe()
  }) %>%
  spread(name, value)
like image 44
slhck Avatar answered Oct 11 '22 19:10

slhck


You should define the calculation for each quantile separately and use summarise. Also use .(team).

library(plyr)
data(baseball)
ddply(baseball,.(team),summarise, X0 = quantile(ab, probs = 0), X25 = quantile(ab, probs = 0.25), X50 = quantile(ab, probs = 0.50), X75 = quantile(ab, probs = 0.75), X100 = quantile(ab, probs = 1))
like image 3
Mikko Avatar answered Oct 11 '22 18:10

Mikko