I'm trying to create an R Script to summarize measures in a data frame. I'd like it to react dynamically to changes in the structure of the data frame. For example, I have the following block.
library(plyr) #loading plyr just to access baseball data frame
MyData <- baseball[,cbind("id","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,"id"]), FUN=sum)
This block creates a data frame (AggHits) with the total hits (h) for each player (id). Yay.
Suppose I want to bring in the team. How do I change the by argument so that AggHits has the total hits for each combination of "id" and "team"? I tried the following and the second line throws an error: arguments must have same length
MyData <- baseball[,cbind("id","team","h")]
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind("id","team")]), FUN=sum)
More generally, I'd like to write the second line so that it automatically aggregates h by all variables except h. I can generate the list of variables to group by pretty easily using setdiff.
# set the list of variables to summarize by as everything except hits
SumOver <- setdiff(colnames(MyData),"h")
# total up all the hits - again this line throws an error
AggHits <- aggregate(x=MyData$h, by=list(MyData[,cbind(SumOver)]), FUN=sum)
The business purpose I'm using this for involves a csv file which has a single measure ($) and currently has about a half dozen dimensions (product, customer, state code, dates, etc.). I'd like to be able to add dimensions to the csv file without having to edit the script each time.
I should mention that I've been able to accomplish this using ddply, but I know that using ddply to summarize a single measure is wasteful in regards to run time; aggregate is much faster.
Thanks in advance!
ANSWER (specific to example in question) Block should be
MyData <- baseball[,cbind("id","team","h")]
SumOver <- setdiff(colnames(MyData),"h")
AggHits <- aggregate(x=MyData$h, by=MyData[SumOver], FUN=sum)
This aggregates by every non-integer column (ID, Team, League), but more generically shows a strategy to aggregate over an arbitrary list of columns (by=MyData[cols.to.group.on]
):
MyData <- plyr::baseball
cols <- names(MyData)[sapply(MyData, class) != "integer"]
aggregate(MyData$h, by=MyData[cols], sum)
Here is a solution using aggregate
from base R
data(baseball, package = "plyr")
MyData <- baseball[,c("id","h", "team")]
AggHits <- aggregate(h ~ ., data = MyData, sum)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With