Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Calculate means for subset of a group

I want to calculate the mean for each "Day" but for a portion of the day (Time=12-14). This code works for me but I have to enter each day as a new line of code, which will amount to hundreds of lines.

This seems like it should be simple to do. I've done this easily when the grouping variables are the same but dont know how to do it when I dont want to include all values for the day. Is there a better way to do this?

sapply(sap[sap$Day==165 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)

sapply(sap[sap$Day==166 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)

Here's what the data looks like:

Day Time    StomCond_Trunc
165 12      33.57189926
165 12.1    50.29437636
165 12.2    35.59876214
165 12.3    24.39879768
like image 332
steph Avatar asked Feb 18 '12 16:02

steph


People also ask

What does subset mean in R?

Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.

How do you find the mean of multiple columns in R?

To find the mean of multiple columns based on multiple grouping columns in R data frame, we can use summarise_at function with mean function.

What is the use of subset () and sample () function in R?

The difference between subset () function and sample () is that, subset () is used to select data from the dataset which meets certain condition, while sample () is used for randomly selecting data of size 'n' from the dataset.


1 Answers

If you have a large dataset, you may also want to look into the data.table package. Converting a data.frame to a data.table is quite easy.

Example:

Large(ish) dataset

df <- data.frame(Day=1:1000000,Time=sample(1:14,1000000,replace=T),StomCond_Trunc=rnorm(100000)*20)

Using aggregate on the data.frame

>system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
   user  system elapsed 
 16.255   0.377  24.263

Converting it to a data.table

 dt <- data.table(df,key="Time")

>system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
   user  system elapsed 
  9.534   0.178  15.270 

Update from Matthew. This timing has improved dramatically since originally answered due to a new optimization feature in data.table 1.8.2.

Retesting the difference between the two approaches, using data.table 1.8.2 in R 2.15.1 :

df <- data.frame(Day=1:1000000,
                 Time=sample(1:14,1000000,replace=T),
                 StomCond_Trunc=rnorm(100000)*20)
system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean)) 
#   user  system elapsed 
#  10.19    0.27   10.47

dt <- data.table(df,key="Time") 
system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day]) 
#   user  system elapsed 
#   0.31    0.00    0.31 
like image 176
Maiasaura Avatar answered Sep 25 '22 02:09

Maiasaura