Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating statistics on subsets of data [duplicate]

Tags:

dataframe

r

Here is a small reproducible example of my data:

> mydata <- structure(list(subject = c(1, 1, 1, 2, 2, 2), time = c(0, 1, 2, 0, 1, 2), measure = c(10, 12, 8, 7, 0, 0)), .Names = c("subject", "time", "measure"), row.names = c(NA, -6L), class = "data.frame")

> mydata

subject  time  measure
1          0      10
1          1      12
1          2       8
2          0       7
2          1       0
2          2       0

I would like to generate a new variable containing the mean of measure for that particular subject, so:

subject  time  measure  mn_measure
1          0      10      10
1          1      12      10
1          2       8      10
2          0       7      2.333
2          1       0      2.333
2          2       0      2.333

Is there an easy way to do this, other than looping through all the records programatically or reshaping to wide format first ?

like image 694
LeelaSella Avatar asked Feb 11 '13 12:02

LeelaSella


3 Answers

Use the base R function ave(), which despite its confusing name, can calculate a variety of statistics, including the mean:

within(mydata, mean<-ave(measure, subject, FUN=mean))

  subject time measure      mean
1       1    0      10 10.000000
2       1    1      12 10.000000
3       1    2       8 10.000000
4       2    0       7  2.333333
5       2    1       0  2.333333
6       2    2       0  2.333333

Note that I use within just for the sake of shorter code. Here is the equivalent without within():

mydata$mean <- ave(mydata$measure, mydata$subject, FUN=mean)
mydata
  subject time measure      mean
1       1    0      10 10.000000
2       1    1      12 10.000000
3       1    2       8 10.000000
4       2    0       7  2.333333
5       2    1       0  2.333333
6       2    2       0  2.333333
like image 171
Andrie Avatar answered Oct 02 '22 18:10

Andrie


Alternatively with data.table package:

require(data.table)
dt <- data.table(mydata, key = "subject")
dt[, mn_measure := mean(measure), by = subject]

#   subject time measure mn_measure
# 1:       1    0      10  10.000000
# 2:       1    1      12  10.000000
# 3:       1    2       8  10.000000
# 4:       2    0       7   2.333333
# 5:       2    1       0   2.333333
# 6:       2    2       0   2.333333
like image 30
Arun Avatar answered Oct 02 '22 19:10

Arun


You can use ddply from the plyr package:

library(plyr)
res = ddply(mydata, .(subject), mutate, mn_measure = mean(measure))
res
  subject time measure mn_measure
1       1    0      10  10.000000
2       1    1      12  10.000000
3       1    2       8  10.000000
4       2    0       7   2.333333
5       2    1       0   2.333333
6       2    2       0   2.333333
like image 39
Paul Hiemstra Avatar answered Oct 02 '22 18:10

Paul Hiemstra