I want to calculate the mean for each "Day" but for a portion of the day (Time=12-14). This code works for me but I have to enter each day as a new line of code, which will amount to hundreds of lines.
This seems like it should be simple to do. I've done this easily when the grouping variables are the same but dont know how to do it when I dont want to include all values for the day. Is there a better way to do this?
sapply(sap[sap$Day==165 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
sapply(sap[sap$Day==166 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
Here's what the data looks like:
Day Time StomCond_Trunc
165 12 33.57189926
165 12.1 50.29437636
165 12.2 35.59876214
165 12.3 24.39879768
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.
To find the mean of multiple columns based on multiple grouping columns in R data frame, we can use summarise_at function with mean function.
The difference between subset () function and sample () is that, subset () is used to select data from the dataset which meets certain condition, while sample () is used for randomly selecting data of size 'n' from the dataset.
If you have a large dataset, you may also want to look into the data.table
package. Converting a data.frame
to a data.table
is quite easy.
Example:
df <- data.frame(Day=1:1000000,Time=sample(1:14,1000000,replace=T),StomCond_Trunc=rnorm(100000)*20)
data.frame
>system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
user system elapsed
16.255 0.377 24.263
data.table
dt <- data.table(df,key="Time")
>system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
user system elapsed
9.534 0.178 15.270
Update from Matthew. This timing has improved dramatically since originally answered due to a new optimization feature in data.table 1.8.2.
Retesting the difference between the two approaches, using data.table 1.8.2 in R 2.15.1 :
df <- data.frame(Day=1:1000000,
Time=sample(1:14,1000000,replace=T),
StomCond_Trunc=rnorm(100000)*20)
system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
# user system elapsed
# 10.19 0.27 10.47
dt <- data.table(df,key="Time")
system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
# user system elapsed
# 0.31 0.00 0.31
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With