If I have a series of observations with two variables X and Y, how can I get the average value of Y based on ranges of variable X?
So for example, with some data like:
df = data.frame(x=runif(50,1,100),y=runif(50,300,700))
How could I get the answer to "When X is 1-10 the average of y 332.4, when X is 11-20 the average of y is 632.3, etc...."
Cut your x using cut
and then use ddply
in package plyr
:
> df$xrange <- cut(df$x, breaks=seq(0, 100, 10))
library(plyr)
ddply(df, .(xrange), summarize, mean_y=mean(y))
xrange mean_y
1 (0,10] 490.7571
2 (10,20] 462.6347
3 (20,30] 507.5614
4 (30,40] 482.6004
5 (40,50] 510.3081
6 (50,60] 480.7927
7 (60,70] 507.8944
8 (70,80] 458.4668
9 (80,90] 501.9672
10 (90,100] 493.4844
Use cut
to form groups and tapply
to summarise over them.
df$grp <- cut(df$x, seq(0, 100, 10))
with(df, tapply(y, grp, mean))
If you are a plyr
fan you may prefer
library(plyr)
ddply(df, .(grp), summarise, m = mean(y))
For completeness, the aggregate
version is
aggregate(y ~ grp, df, mean)
One way is to use cut()
to create a factor from the x
variable, specifying breaks every ten units. Given that factor, you can then use by()
or aggregate()
or ... to summarise the data frame, or rather just column y
:
R> set.seed(42); DF <- data.frame(x=runif(50,1,100), y=rnorm(50,30,70))
R> summary(DF)
x y
Min. : 1.39 Min. :-179.5
1st Qu.:40.66 1st Qu.: -19.4
Median :64.45 Median : 39.6
Mean :60.29 Mean : 25.9
3rd Qu.:90.10 3rd Qu.: 74.7
Max. :98.90 Max. : 140.3
R> DF$cx <- cut(DF$x, breaks=seq(0,100,by=10))
R> ?by
R> by(DF, DF$cx, FUN=function(z) mean(z$y))
DF$cx: (0,10]
[1] 67.8747
---------------------------------------------
DF$cx: (10,20]
[1] 52.9104
---------------------------------------------
DF$cx: (20,30]
[1] -53.8961
---------------------------------------------
DF$cx: (30,40]
[1] 44.1992
---------------------------------------------
DF$cx: (40,50]
[1] 21.7404
---------------------------------------------
DF$cx: (50,60]
[1] 16.2122
---------------------------------------------
DF$cx: (60,70]
[1] -27.0338
---------------------------------------------
DF$cx: (70,80]
[1] 42.283
---------------------------------------------
DF$cx: (80,90]
[1] 40.8042
---------------------------------------------
DF$cx: (90,100]
[1] 38.8917
R>
Or using ddply()
:
R> library(plyr)
R> ddply(DF, .(cx), function(z) mean(z$y))
cx V1
1 (0,10] 67.8747
2 (10,20] 52.9104
3 (20,30] -53.8961
4 (30,40] 44.1992
5 (40,50] 21.7404
6 (50,60] 16.2122
7 (60,70] -27.0338
8 (70,80] 42.2830
9 (80,90] 40.8042
10 (90,100] 38.8917
R>
I think your question is causing your answers to be too narrow. You ought to be thinking of regression methods to summarize the joint relationships of continuous variables. Plotting with scatterplots and fitting regression splines is going to do less violence to the underlying relationships than the piecewise analysis that you specified.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With