How to speed up summarise and ddply?

Question

I have a data frame with 2 million rows, and 15 columns. I want to group by 3 of these columns with ddply (all 3 are factors, and there are 780,000 unique combinations of these factors), and get the weighted mean of 3 columns (with weights defined by my data set). The following is reasonably quick:

system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean))
   user  system elapsed 
 91.358   4.747 115.727

The problem is that I want to use weighted.mean instead of mean to calculate my aggregate columns.

If I try the following ddply on the same data frame (note, I cast to immutable), the following does not finish after 20 minutes:

x <- ddply(idata.frame(aggdf), 
       c("fac1","fac2","fac3"), 
       summarise, 
       w=sum(w), 
       col1=weighted.mean(col1, w), 
       col2=weighted.mean(col2, w),
       col3=weighted.mean(col3, w))

This operation seems to be CPU hungry, but not very RAM-intensive.

EDIT: So I ended up writing this little function, which "cheats" a bit by taking advantage of some properties of weighted mean and does a multiplication and a division on the whole object, rather than on the slices.

weighted_mean_cols <- function(df, bycols, aggcols, weightcol) {
    df[,aggcols] <- df[,aggcols]*df[,weightcol]
    df <- aggregate(df[,c(weightcol, aggcols)], by=as.list(df[,bycols]), sum)
    df[,aggcols] <- df[,aggcols]/df[,weightcol]
    df
}

When I run as:

a2 <- weighted_mean_cols(aggdf, c("fac1","fac2","fac3"), c("col1","col2","col3"),"w")

I get good performance, and somewhat reusable, elegant code.

crayola · Accepted Answer

Though ddply is hard to beat for elegance and ease of code, I find that for big data, tapply is much faster. In your case, I would use a

do.call("cbind", list((w <- tapply(..)), tapply(..)))

Sorry for the dots and possibly faulty understanding of the question; but I am in a bit of a rush and must catch a bus in about minus five minutes!

How to speed up summarise and ddply?

Tags:

r

plyr

evanrsparks

1 Answers

crayola

Recent Activity

Donate For Us

How to speed up summarise and ddply?

Tags:

r

plyr

evanrsparks

1 Answers

crayola

Related questions

Recent Activity

Donate For Us