Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: aggregate with column-specific function

Tags:

r

aggregate

I would like to aggregate a data frame by time interval, applying a different function to each column. I think I almost have aggregate down, and have divided my data into intervals with the chron package, which was easy enough.

But I'm not sure how to process the subsets. All of the mapping functions, *apply, *ply, take one function (I was hoping for something that took a vector of functions to apply per-column or -variable, but haven't found one) so I'm writing a function that takes my data frame subsets, and gives me the mean for all variables, except "time", which is the index, and "Runoff" which should be the sum.

I tried this:

aggregate(d., list(Time=trunc(d.$time, "00:10:00")), function (dat) with(dat, 
list(Time=time[1], mean(Port.1), mean(Port.1.1), mean(Port.2), mean(Port.2.1), 
mean(Port.3), mean(Port.3.1), mean(Port.4), mean(Port.4.1), Runoff=sum(Port.5))))

which would be ugly enough even if it didn't give me this error:

Error in eval(substitute(expr), data, enclos = parent.frame()) : 
  not that many frames on the stack

which tells me I'm really doing something wrong. From what I've seen of R I think there must be an elegant way to do this, but what is it?

dput:

d. <- structure(list(time = structure(c(15030.5520833333, 15030.5555555556, 
15030.5590277778, 15030.5625, 15030.5659722222), format = structure(c("m/d/y", 
"h:m:s"), .Names = c("dates", "times")), origin = structure(c(1, 
1, 1970), .Names = c("month", "day", "year")), class = c("chron", 
"dates", "times")), Port.1 = c(0.359747, 0.418139, 0.417459, 
0.418139, 0.417459), Port.1.1 = c(1.3, 11.8, 11.9, 12, 12.1), 
    Port.2 = c(0.288837, 0.335544, 0.335544, 0.335544, 0.335544
    ), Port.2.1 = c(2.3, 13, 13.2, 13.3, 13.4), Port.3 = c(0.253942, 
    0.358257, 0.358257, 0.358257, 0.359002), Port.3.1 = c(2, 
    12.6, 12.7, 12.9, 13.1), Port.4 = c(0.352269, 0.410609, 0.410609, 
    0.410609, 0.410609), Port.4.1 = c(5.9, 17.5, 17.6, 17.7, 
    17.9), Port.5 = c(0L, 0L, 0L, 0L, 0L)), .Names = c("time", 
"Port.1", "Port.1.1", "Port.2", "Port.2.1", "Port.3", "Port.3.1", 
"Port.4", "Port.4.1", "Port.5"), row.names = c(NA, 5L), class = "data.frame")
like image 846
scry Avatar asked Jul 22 '12 16:07

scry


3 Answers

There are a lot of things wrong with your approach. A general piece of advice is not to go straight for what you think the final statement should look like, but work things in increments, otherwise it makes debugging (understanding and fixing errors) quite hard.

For example, you could have started with:

aggregate(d., list(Time=trunc(d.$time, "00:10:00")), identity)

to notice that there is something wrong with your split variable. Apparently aggregate does not like working with this class of data. You can fix this problem by converting Time to numeric:

aggregate(d., list(Time=as.numeric(trunc(d.$time, "00:10:00"))), identity)

Then you can try

aggregate(d., list(Time=as.numeric(trunc(d.$time, "00:10:00"))), apply.fun)

where apply.fun is your user-defined function. This fails with a rather criptic message, but running

aggregate(d., list(Time=as.numeric(trunc(d.$time, "00:10:00"))), print)

helps realize that the FUN function inside aggregate is not called once for each data piece (and passed a data.frame), but it is called once for each column of your data pieces (and passed an unnamed vector), so there is no way you can get the result you want using aggregate.

Instead, you could use the ddply function from the plyr package. There, the function applied to each piece does receive a data.frame so you can do something like this:

apply.fun <- function(dat) with(dat, data.frame(Time=time[1],
                                                mean(Port.1),
                                                mean(Port.1.1),
                                                mean(Port.2),
                                                mean(Port.2.1),
                                                mean(Port.3),
                                                mean(Port.3.1),
                                                mean(Port.4),
                                                mean(Port.4.1),
                                                Runoff=sum(Port.5)))

d.$Time <- as.numeric(trunc(d.$time, "00:10:00"))
library(plyr)
ddply(d., "Time", apply.fun)

#            Time mean.Port.1. mean.Port.1.1. mean.Port.2. mean.Port.2.1.
# 1 15030.5520833    0.4061886           9.82    0.3262026          11.04
#   mean.Port.3. mean.Port.3.1. mean.Port.4. mean.Port.4.1. Runoff
# 1     0.337543          10.66     0.398941          15.32      0

Edit: Follow-up on @roysc question in the first comment below, you can do:

apply.fun <- function(dat) {
  out <- as.data.frame(lapply(dat, mean))
  out$Time <- dat$time[1]
  out$Runoff <- sum(dat$Port.5)
  return(out)
}
like image 96
flodel Avatar answered Oct 20 '22 16:10

flodel


Use by instead of aggregate.

If f is the samee as your anonymous function except that list within it is replaced with data.frame so that f <- function(dat) with(dat, data.frame(...whatever...)) then:

d.by <- by(d., list(Time = trunc(d.$time, "00:10:00")), f)
d.rbind <- do.call("rbind", d.by) # bind rows together

# fix up row and column names
rownames(d.rbind) <- NULL
colnames(d.rbind) <- colnames(d.)

We could remove the last statement which assigns column names if f added the names itself rather than just Time.

like image 41
G. Grothendieck Avatar answered Oct 20 '22 17:10

G. Grothendieck


How about this?

library(plyr)
ddply(d., .(time), colMeans)
like image 20
Maiasaura Avatar answered Oct 20 '22 15:10

Maiasaura