R: speeding up "group by" operations

Question

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one step were faster I could greatly decrease the number of nodes I need.

Here's a reasonable simplification of what I am trying to do:

library(Hmisc)  # Set up some example data year <-    sample(1970:2008, 1e6, rep=T) state <-   sample(1:50, 1e6, rep=T) group1 <-  sample(1:6, 1e6, rep=T) group2 <-  sample(1:3, 1e6, rep=T) myFact <-  rnorm(100, 15, 1e6) weights <- rnorm(1e6) myDF <- data.frame(year, state, group1, group2, myFact, weights)  # this is the step I want to make faster system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),                      function(df) wtd.mean(df$myFact, weights=df$weights)                                  )            )

All tips or suggestions are appreciated!

hadley · Accepted Answer

Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:

idf <- idata.frame(myDF) system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),    function(df) wtd.mean(df$myFact, weights=df$weights)))  #    user  system elapsed  # 18.032   0.416  19.250

If I was to write a plyr function customised exactly to this situation, I'd do something like this:

system.time({   ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)   data <- as.matrix(myDF[c("myFact", "weights")])   indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))    fun <- function(rows) {     weighted.mean(data[rows, 1], data[rows, 2])   }   values <- vapply(indices, fun, numeric(1))    labels <- myDF[match(seq_len(attr(ids, "n")), ids),      c("year", "state", "group1", "group2")]   aggregateDF <- cbind(labels, values) })  # user  system elapsed  # 2.04    0.29    2.33

It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.

R: speeding up "group by" operations

Tags:

performance

r

plyr

JD Long

1 Answers

hadley

Recent Activity

Donate For Us

R: speeding up "group by" operations

Tags:

performance

r

plyr

JD Long

1 Answers

hadley

Related questions

Recent Activity

Donate For Us