R How to Get the Average of One Variable based on Ranges of Another Variable?

Question

If I have a series of observations with two variables X and Y, how can I get the average value of Y based on ranges of variable X?

So for example, with some data like:

df = data.frame(x=runif(50,1,100),y=runif(50,300,700))

How could I get the answer to "When X is 1-10 the average of y 332.4, when X is 11-20 the average of y is 632.3, etc...."

Andrie · Accepted Answer

Cut your x using cut and then use ddply in package plyr:

> df$xrange <- cut(df$x, breaks=seq(0, 100, 10))

library(plyr)
ddply(df, .(xrange), summarize, mean_y=mean(y))
     xrange   mean_y
1    (0,10] 490.7571
2   (10,20] 462.6347
3   (20,30] 507.5614
4   (30,40] 482.6004
5   (40,50] 510.3081
6   (50,60] 480.7927
7   (60,70] 507.8944
8   (70,80] 458.4668
9   (80,90] 501.9672
10 (90,100] 493.4844

Richie Cotton · Answer

Use cut to form groups and tapply to summarise over them.

df$grp <- cut(df$x, seq(0, 100, 10))
with(df, tapply(y, grp, mean))

If you are a plyr fan you may prefer

library(plyr)
ddply(df, .(grp), summarise, m = mean(y))

For completeness, the aggregate version is

aggregate(y ~ grp, df, mean)

Dirk Eddelbuettel · Answer

One way is to use cut() to create a factor from the x variable, specifying breaks every ten units. Given that factor, you can then use by() or aggregate() or ... to summarise the data frame, or rather just column y:

R> set.seed(42); DF <- data.frame(x=runif(50,1,100), y=rnorm(50,30,70))
R> summary(DF)
       x               y         
 Min.   : 1.39   Min.   :-179.5  
 1st Qu.:40.66   1st Qu.: -19.4  
 Median :64.45   Median :  39.6  
 Mean   :60.29   Mean   :  25.9  
 3rd Qu.:90.10   3rd Qu.:  74.7  
 Max.   :98.90   Max.   : 140.3  
R> DF$cx <- cut(DF$x, breaks=seq(0,100,by=10))
R> ?by
R> by(DF, DF$cx, FUN=function(z) mean(z$y))
DF$cx: (0,10]
[1] 67.8747
--------------------------------------------- 
DF$cx: (10,20]
[1] 52.9104
--------------------------------------------- 
DF$cx: (20,30]
[1] -53.8961
--------------------------------------------- 
DF$cx: (30,40]
[1] 44.1992
--------------------------------------------- 
DF$cx: (40,50]
[1] 21.7404
--------------------------------------------- 
DF$cx: (50,60]
[1] 16.2122
--------------------------------------------- 
DF$cx: (60,70]
[1] -27.0338
--------------------------------------------- 
DF$cx: (70,80]
[1] 42.283
--------------------------------------------- 
DF$cx: (80,90]
[1] 40.8042
--------------------------------------------- 
DF$cx: (90,100]
[1] 38.8917
R>

Or using ddply():

R> library(plyr)
R> ddply(DF, .(cx), function(z) mean(z$y))
         cx       V1
1    (0,10]  67.8747
2   (10,20]  52.9104
3   (20,30] -53.8961
4   (30,40]  44.1992
5   (40,50]  21.7404
6   (50,60]  16.2122
7   (60,70] -27.0338
8   (70,80]  42.2830
9   (80,90]  40.8042
10 (90,100]  38.8917
R>

IRTFM · Answer

I think your question is causing your answers to be too narrow. You ought to be thinking of regression methods to summarize the joint relationships of continuous variables. Plotting with scatterplots and fitting regression splines is going to do less violence to the underlying relationships than the piecewise analysis that you specified.

R How to Get the Average of One Variable based on Ranges of Another Variable?

Tags:

dataframe

r

aggregate

Kyle Brandt

4 Answers

Andrie

Richie Cotton

Dirk Eddelbuettel

IRTFM

Recent Activity

Donate For Us

R How to Get the Average of One Variable based on Ranges of Another Variable?

Tags:

dataframe

r

aggregate

Kyle Brandt

4 Answers

Andrie

Richie Cotton

Dirk Eddelbuettel

IRTFM

Related questions

Recent Activity

Donate For Us