Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quantiles by factor levels in R

Tags:

r

I have a data frame and I'm trying to create a new variable in the data frame that has the quantiles of a continuous variable var1, for each level of a factor strata.

# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
                  strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
                  )

# function to get quantiles
qfun <- function(x, q = 5) {
    quantile <- cut(x, breaks = quantile(x, probs = 0:q/q), 
        include.lowest = TRUE, labels = 1:q)
    quantile
}

I tried using two methods, neither of which produce a usable result. Firstly, I tried using aggregate to apply qfun to each level of strata:

qdat <- with(dat, aggregate(var1, list(strata), FUN = qfun))

This returns the quantiles by factor level, but the output is hard to coerce back into a data frame (e.g., using unlist does not line the new variable values up with the correct rows in the data frame).

A second approach was to do this in steps:

tmp1 <- with(dat, split(var1, strata))
tmp2 <- lapply(tmp1, qfun)
tmp3 <- unlist(tmp2)
dat$quintiles <- tmp3

Again, this calculates the quantiles correctly for each factor level, but obviously, as with aggregate they aren't in the correct order in the data frame. We can check this by putting the quantile "bins" into the data frame.

# get quantile bins
qfun2 <- function(x, q = 5) {
    quantile <- cut(x, breaks = quantile(x, probs = 0:q/q), 
        include.lowest = TRUE)
    quantile
}

tmp11 <- with(dat, split(var1, strata))
tmp22 <- lapply(tmp11, qfun2)
tmp33 <- unlist(tmp22)
dat$quintiles2 <- tmp33

Many of the values of var1 are outside of the bins of quantile2. I feel like i'm missing something simple. Any suggestions would be greatly appreciated.

like image 769
Chris Avatar asked Mar 22 '13 03:03

Chris


1 Answers

I think your issue is that you don't really want to aggregate, but use ave, (or data.table or plyr)

qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))

#using plyr
library(plyr)

qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))

#using data.table (my preference)


dat[, qq := qfun(var1), by = strata]

Aggregate usually implies returning an object that is smaller that the original. (inthis case you were getting a data.frame where x was a list of 1 element for each strata.

like image 173
mnel Avatar answered Oct 14 '22 04:10

mnel