Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr for rowwise quantiles

Tags:

r

dplyr

I have a df of strata, each of which has 1000 samples from a posterior distribution of the estimates from that stratum.

mydf <- as.data.frame(lapply(seq(1, 1000), rnorm, n=100))
colnames(mydf) <- paste('s', seq(1, ncol(mydf)), sep='')

I want to add columns for a few quantiles of the distribution for each row. In classic R, I'd write this.

quants <- t(apply(mydf, 1, quantile, probs=c(.025, .5, .975)))
colnames(quants) <- c('s_lo', 's_med', 's_hi')
mydf <- cbind(mydf, quants)

I suspect there's a direct way to do this in dplyr (maybe rowwise?) but my attempts have failed. Ideas?

like image 350
wylbur Avatar asked Apr 18 '17 19:04

wylbur


1 Answers

dplyr is not optimized for row-based calculations like that. Though you can do this with rowwise(), I recommend against it: performance will be abysmal. Your best speed will likely be with something that expects a matrix, and can operate on the rows. I suggest apply.

Instead of dealing with a 100x1000 data.frame, for brevity I'll go with 5 columns:

set.seed(2)
mydf <- as.data.frame(lapply(seq(1, 5), rnorm, n=10))
colnames(mydf) <- paste('s', seq(1, ncol(mydf)), sep='')

Converting to a matrix is only reasonable if all columns are of the same class. In this case, they are all numeric so we are safe. (If you have non-numeric columns in the dataframe, extract only the ones you need here and bind them back in later.)

mymtx <- as.matrix(mydf)
apply(mymtx, 1, quantile, c(0.1, 0.9))
#         [,1]     [,2]     [,3]     [,4]     [,5]       [,6]     [,7]     [,8]     [,9]    [,10]
# 10% 1.028912 1.430939 1.999521 0.305907 1.753824 0.03267599 1.934381 1.270504 2.995816 1.489634
# 90% 4.950067 3.807735 4.881554 6.123989 4.886388 5.55628806 4.207605 4.184460 4.406384 3.782134

One notable with using apply like this is that the result is in row-based form, perhaps transposed from what one would expect. Simply wrap it in t(...) and you'll see the columns you might expect.

This can be recombined with the original dataframe using cbind or similar function.

This can be done in a pipeline like so:

mydf %>%
  bind_cols(as.data.frame(t(apply(., 1, quantile, c(0.1, 0.9)))))
#            s1         s2        s3       s4       s5        10%      90%
# 1   0.1030855  2.4176508 5.0908192 4.738939 4.616414 1.02891157 4.950067
# 2   1.1848492  2.9817528 1.8000742 4.318960 3.040897 1.43093918 3.807735
# 3   2.5878453  1.6073046 4.5896382 5.076164 4.158295 1.99952092 4.881554
# 4  -0.1303757  0.9603310 4.9546516 3.715842 6.903547 0.30590700 6.123989
# 5   0.9197482  3.7822290 3.0049378 3.223325 5.622494 1.75382406 4.886388
# 6   1.1324203 -0.3110691 0.5482936 3.404340 6.990920 0.03267599 5.556288
# 7   1.7079547  2.8786046 3.4772373 2.274020 4.694516 1.93438093 4.207605
# 8   0.7603020  2.0358067 2.4034418 3.097416 4.909156 1.27050387 4.184460
# 9   2.9844739  3.0128287 3.7922033 3.440938 4.815839 2.99581584 4.406384
# 10  0.8612130  2.4322652 3.2896367 3.753487 3.801232 1.48963385 3.782134

I'll leave the column naming up to you.

like image 86
r2evans Avatar answered Sep 24 '22 06:09

r2evans