Calculating statistics on subsets of data [duplicate]

Question

Here is a small reproducible example of my data:

> mydata <- structure(list(subject = c(1, 1, 1, 2, 2, 2), time = c(0, 1, 2, 0, 1, 2), measure = c(10, 12, 8, 7, 0, 0)), .Names = c("subject", "time", "measure"), row.names = c(NA, -6L), class = "data.frame")

> mydata

subject  time  measure
1          0      10
1          1      12
1          2       8
2          0       7
2          1       0
2          2       0

I would like to generate a new variable containing the mean of measure for that particular subject, so:

subject  time  measure  mn_measure
1          0      10      10
1          1      12      10
1          2       8      10
2          0       7      2.333
2          1       0      2.333
2          2       0      2.333

Is there an easy way to do this, other than looping through all the records programatically or reshaping to wide format first ?

Andrie · Accepted Answer

Use the base R function ave(), which despite its confusing name, can calculate a variety of statistics, including the mean:

within(mydata, mean<-ave(measure, subject, FUN=mean))

  subject time measure      mean
1       1    0      10 10.000000
2       1    1      12 10.000000
3       1    2       8 10.000000
4       2    0       7  2.333333
5       2    1       0  2.333333
6       2    2       0  2.333333

Note that I use within just for the sake of shorter code. Here is the equivalent without within():

mydata$mean <- ave(mydata$measure, mydata$subject, FUN=mean)
mydata
  subject time measure      mean
1       1    0      10 10.000000
2       1    1      12 10.000000
3       1    2       8 10.000000
4       2    0       7  2.333333
5       2    1       0  2.333333
6       2    2       0  2.333333

Arun · Answer

Alternatively with data.table package:

require(data.table)
dt <- data.table(mydata, key = "subject")
dt[, mn_measure := mean(measure), by = subject]

#   subject time measure mn_measure
# 1:       1    0      10  10.000000
# 2:       1    1      12  10.000000
# 3:       1    2       8  10.000000
# 4:       2    0       7   2.333333
# 5:       2    1       0   2.333333
# 6:       2    2       0   2.333333

Paul Hiemstra · Answer

You can use ddply from the plyr package:

library(plyr)
res = ddply(mydata, .(subject), mutate, mn_measure = mean(measure))
res
  subject time measure mn_measure
1       1    0      10  10.000000
2       1    1      12  10.000000
3       1    2       8  10.000000
4       2    0       7   2.333333
5       2    1       0   2.333333
6       2    2       0   2.333333

Calculating statistics on subsets of data [duplicate]

Tags:

dataframe

r

LeelaSella

3 Answers

Andrie

Arun

Paul Hiemstra

Recent Activity

Donate For Us

Calculating statistics on subsets of data [duplicate]

Tags:

dataframe

r

LeelaSella

3 Answers

Andrie

Arun

Paul Hiemstra

Related questions

Recent Activity

Donate For Us