I would like to aggregate a data frame by time interval, applying a different function to each column. I think I almost have aggregate
down, and have divided my data into intervals with the chron
package, which was easy enough.
But I'm not sure how to process the subsets. All of the mapping functions, *apply
, *ply
, take one function (I was hoping for something that took a vector of functions to apply per-column or -variable, but haven't found one) so I'm writing a function that takes my data frame subsets, and gives me the mean for all variables, except "time", which is the index, and "Runoff" which should be the sum.
I tried this:
aggregate(d., list(Time=trunc(d.$time, "00:10:00")), function (dat) with(dat,
list(Time=time[1], mean(Port.1), mean(Port.1.1), mean(Port.2), mean(Port.2.1),
mean(Port.3), mean(Port.3.1), mean(Port.4), mean(Port.4.1), Runoff=sum(Port.5))))
which would be ugly enough even if it didn't give me this error:
Error in eval(substitute(expr), data, enclos = parent.frame()) :
not that many frames on the stack
which tells me I'm really doing something wrong. From what I've seen of R I think there must be an elegant way to do this, but what is it?
dput:
d. <- structure(list(time = structure(c(15030.5520833333, 15030.5555555556,
15030.5590277778, 15030.5625, 15030.5659722222), format = structure(c("m/d/y",
"h:m:s"), .Names = c("dates", "times")), origin = structure(c(1,
1, 1970), .Names = c("month", "day", "year")), class = c("chron",
"dates", "times")), Port.1 = c(0.359747, 0.418139, 0.417459,
0.418139, 0.417459), Port.1.1 = c(1.3, 11.8, 11.9, 12, 12.1),
Port.2 = c(0.288837, 0.335544, 0.335544, 0.335544, 0.335544
), Port.2.1 = c(2.3, 13, 13.2, 13.3, 13.4), Port.3 = c(0.253942,
0.358257, 0.358257, 0.358257, 0.359002), Port.3.1 = c(2,
12.6, 12.7, 12.9, 13.1), Port.4 = c(0.352269, 0.410609, 0.410609,
0.410609, 0.410609), Port.4.1 = c(5.9, 17.5, 17.6, 17.7,
17.9), Port.5 = c(0L, 0L, 0L, 0L, 0L)), .Names = c("time",
"Port.1", "Port.1.1", "Port.2", "Port.2.1", "Port.3", "Port.3.1",
"Port.4", "Port.4.1", "Port.5"), row.names = c(NA, 5L), class = "data.frame")
There are a lot of things wrong with your approach. A general piece of advice is not to go straight for what you think the final statement should look like, but work things in increments, otherwise it makes debugging (understanding and fixing errors) quite hard.
For example, you could have started with:
aggregate(d., list(Time=trunc(d.$time, "00:10:00")), identity)
to notice that there is something wrong with your split variable. Apparently aggregate
does not like working with this class of data. You can fix this problem by converting Time
to numeric:
aggregate(d., list(Time=as.numeric(trunc(d.$time, "00:10:00"))), identity)
Then you can try
aggregate(d., list(Time=as.numeric(trunc(d.$time, "00:10:00"))), apply.fun)
where apply.fun
is your user-defined function. This fails with a rather criptic message, but running
aggregate(d., list(Time=as.numeric(trunc(d.$time, "00:10:00"))), print)
helps realize that the FUN
function inside aggregate
is not called once for each data piece (and passed a data.frame), but it is called once for each column of your data pieces (and passed an unnamed vector), so there is no way you can get the result you want using aggregate
.
Instead, you could use the ddply
function from the plyr
package. There, the function applied to each piece does receive a data.frame so you can do something like this:
apply.fun <- function(dat) with(dat, data.frame(Time=time[1],
mean(Port.1),
mean(Port.1.1),
mean(Port.2),
mean(Port.2.1),
mean(Port.3),
mean(Port.3.1),
mean(Port.4),
mean(Port.4.1),
Runoff=sum(Port.5)))
d.$Time <- as.numeric(trunc(d.$time, "00:10:00"))
library(plyr)
ddply(d., "Time", apply.fun)
# Time mean.Port.1. mean.Port.1.1. mean.Port.2. mean.Port.2.1.
# 1 15030.5520833 0.4061886 9.82 0.3262026 11.04
# mean.Port.3. mean.Port.3.1. mean.Port.4. mean.Port.4.1. Runoff
# 1 0.337543 10.66 0.398941 15.32 0
Edit: Follow-up on @roysc question in the first comment below, you can do:
apply.fun <- function(dat) {
out <- as.data.frame(lapply(dat, mean))
out$Time <- dat$time[1]
out$Runoff <- sum(dat$Port.5)
return(out)
}
Use by
instead of aggregate
.
If f
is the samee as your anonymous function except that list
within it is replaced with data.frame
so that f <- function(dat) with(dat, data.frame(...whatever...))
then:
d.by <- by(d., list(Time = trunc(d.$time, "00:10:00")), f)
d.rbind <- do.call("rbind", d.by) # bind rows together
# fix up row and column names
rownames(d.rbind) <- NULL
colnames(d.rbind) <- colnames(d.)
We could remove the last statement which assigns column names if f
added the names itself rather than just Time
.
How about this?
library(plyr)
ddply(d., .(time), colMeans)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With