Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R How to Get the Average of One Variable based on Ranges of Another Variable?

If I have a series of observations with two variables X and Y, how can I get the average value of Y based on ranges of variable X?

So for example, with some data like:

df = data.frame(x=runif(50,1,100),y=runif(50,300,700))

How could I get the answer to "When X is 1-10 the average of y 332.4, when X is 11-20 the average of y is 632.3, etc...."

like image 440
Kyle Brandt Avatar asked Aug 30 '11 14:08

Kyle Brandt


4 Answers

Cut your x using cut and then use ddply in package plyr:

> df$xrange <- cut(df$x, breaks=seq(0, 100, 10))

library(plyr)
ddply(df, .(xrange), summarize, mean_y=mean(y))
     xrange   mean_y
1    (0,10] 490.7571
2   (10,20] 462.6347
3   (20,30] 507.5614
4   (30,40] 482.6004
5   (40,50] 510.3081
6   (50,60] 480.7927
7   (60,70] 507.8944
8   (70,80] 458.4668
9   (80,90] 501.9672
10 (90,100] 493.4844
like image 172
Andrie Avatar answered Nov 11 '22 17:11

Andrie


Use cut to form groups and tapply to summarise over them.

df$grp <- cut(df$x, seq(0, 100, 10))
with(df, tapply(y, grp, mean))

If you are a plyr fan you may prefer

library(plyr)
ddply(df, .(grp), summarise, m = mean(y))

For completeness, the aggregate version is

aggregate(y ~ grp, df, mean)
like image 43
Richie Cotton Avatar answered Nov 11 '22 18:11

Richie Cotton


One way is to use cut() to create a factor from the x variable, specifying breaks every ten units. Given that factor, you can then use by() or aggregate() or ... to summarise the data frame, or rather just column y:

R> set.seed(42); DF <- data.frame(x=runif(50,1,100), y=rnorm(50,30,70))
R> summary(DF)
       x               y         
 Min.   : 1.39   Min.   :-179.5  
 1st Qu.:40.66   1st Qu.: -19.4  
 Median :64.45   Median :  39.6  
 Mean   :60.29   Mean   :  25.9  
 3rd Qu.:90.10   3rd Qu.:  74.7  
 Max.   :98.90   Max.   : 140.3  
R> DF$cx <- cut(DF$x, breaks=seq(0,100,by=10))
R> ?by
R> by(DF, DF$cx, FUN=function(z) mean(z$y))
DF$cx: (0,10]
[1] 67.8747
--------------------------------------------- 
DF$cx: (10,20]
[1] 52.9104
--------------------------------------------- 
DF$cx: (20,30]
[1] -53.8961
--------------------------------------------- 
DF$cx: (30,40]
[1] 44.1992
--------------------------------------------- 
DF$cx: (40,50]
[1] 21.7404
--------------------------------------------- 
DF$cx: (50,60]
[1] 16.2122
--------------------------------------------- 
DF$cx: (60,70]
[1] -27.0338
--------------------------------------------- 
DF$cx: (70,80]
[1] 42.283
--------------------------------------------- 
DF$cx: (80,90]
[1] 40.8042
--------------------------------------------- 
DF$cx: (90,100]
[1] 38.8917
R> 

Or using ddply():

R> library(plyr)
R> ddply(DF, .(cx), function(z) mean(z$y))
         cx       V1
1    (0,10]  67.8747
2   (10,20]  52.9104
3   (20,30] -53.8961
4   (30,40]  44.1992
5   (40,50]  21.7404
6   (50,60]  16.2122
7   (60,70] -27.0338
8   (70,80]  42.2830
9   (80,90]  40.8042
10 (90,100]  38.8917
R> 
like image 45
Dirk Eddelbuettel Avatar answered Nov 11 '22 16:11

Dirk Eddelbuettel


I think your question is causing your answers to be too narrow. You ought to be thinking of regression methods to summarize the joint relationships of continuous variables. Plotting with scatterplots and fitting regression splines is going to do less violence to the underlying relationships than the piecewise analysis that you specified.

like image 29
IRTFM Avatar answered Nov 11 '22 17:11

IRTFM