Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: speeding up "group by" operations

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one step were faster I could greatly decrease the number of nodes I need.

Here's a reasonable simplification of what I am trying to do:

library(Hmisc)  # Set up some example data year <-    sample(1970:2008, 1e6, rep=T) state <-   sample(1:50, 1e6, rep=T) group1 <-  sample(1:6, 1e6, rep=T) group2 <-  sample(1:3, 1e6, rep=T) myFact <-  rnorm(100, 15, 1e6) weights <- rnorm(1e6) myDF <- data.frame(year, state, group1, group2, myFact, weights)  # this is the step I want to make faster system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),                      function(df) wtd.mean(df$myFact, weights=df$weights)                                  )            ) 

All tips or suggestions are appreciated!

like image 852
JD Long Avatar asked Sep 10 '10 14:09

JD Long


1 Answers

Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:

idf <- idata.frame(myDF) system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),    function(df) wtd.mean(df$myFact, weights=df$weights)))  #    user  system elapsed  # 18.032   0.416  19.250  

If I was to write a plyr function customised exactly to this situation, I'd do something like this:

system.time({   ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)   data <- as.matrix(myDF[c("myFact", "weights")])   indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))    fun <- function(rows) {     weighted.mean(data[rows, 1], data[rows, 2])   }   values <- vapply(indices, fun, numeric(1))    labels <- myDF[match(seq_len(attr(ids, "n")), ids),      c("year", "state", "group1", "group2")]   aggregateDF <- cbind(labels, values) })  # user  system elapsed  # 2.04    0.29    2.33  

It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.

like image 67
hadley Avatar answered Sep 28 '22 06:09

hadley