Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate Function with Variable By List

Tags:

r

I'm trying to create an R Script to summarize measures in a data frame. I'd like it to react dynamically to changes in the structure of the data frame. For example, I have the following block.

library(plyr) #loading plyr just to access baseball data frame
MyData <- baseball[,cbind("id","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,"id"]), FUN=sum)

This block creates a data frame (AggHits) with the total hits (h) for each player (id). Yay.

Suppose I want to bring in the team. How do I change the by argument so that AggHits has the total hits for each combination of "id" and "team"? I tried the following and the second line throws an error: arguments must have same length

MyData <- baseball[,cbind("id","team","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind("id","team")]), FUN=sum)

More generally, I'd like to write the second line so that it automatically aggregates h by all variables except h. I can generate the list of variables to group by pretty easily using setdiff.

# set the list of variables to summarize by as everything except hits
SumOver <- setdiff(colnames(MyData),"h")

# total up all the hits - again this line throws an error
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind(SumOver)]), FUN=sum)

The business purpose I'm using this for involves a csv file which has a single measure ($) and currently has about a half dozen dimensions (product, customer, state code, dates, etc.). I'd like to be able to add dimensions to the csv file without having to edit the script each time.

I should mention that I've been able to accomplish this using ddply, but I know that using ddply to summarize a single measure is wasteful in regards to run time; aggregate is much faster.

Thanks in advance!

ANSWER (specific to example in question) Block should be

MyData <- baseball[,cbind("id","team","h")]
SumOver <- setdiff(colnames(MyData),"h")
AggHits <- aggregate(x=MyData$h, by=MyData[SumOver], FUN=sum)
like image 660
Adam Hoelscher Avatar asked Feb 14 '23 10:02

Adam Hoelscher


2 Answers

This aggregates by every non-integer column (ID, Team, League), but more generically shows a strategy to aggregate over an arbitrary list of columns (by=MyData[cols.to.group.on]):

MyData <- plyr::baseball
cols <- names(MyData)[sapply(MyData, class) != "integer"]
aggregate(MyData$h, by=MyData[cols], sum)
like image 126
BrodieG Avatar answered Feb 16 '23 02:02

BrodieG


Here is a solution using aggregate from base R

data(baseball, package = "plyr")

MyData  <- baseball[,c("id","h", "team")]
AggHits <- aggregate(h ~ ., data = MyData, sum)
like image 44
Ramnath Avatar answered Feb 16 '23 04:02

Ramnath