Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use ddply with varying .variables?

Tags:

r

plyr

I use ddply to summarize some data.frameby various categories, like this:

# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
                  sumGroupSize = sum(someValue))

This works smoothly, but often I like to calculate ratios which implies that I need to divide by the group's total. How can I calculate such a total within the same ddply call?

Let's say I'd like to have the share of observations in group A that are in size class 1. Obviously I have to calculate the sum of all observations in size class 1 first. Sure I could do this with two ddply calls, but using all one call would be more comfortable. Is there a way to do so?

EDIT: I did not mean to ask overly specific, but I realize I was disturbing people here. So here's my specific problem. In fact I do have an example that works, but I don't consider it really nifty. Plus it has a shortcoming that I need to overcome: it does not work correctly with apply.

library(plyr)

# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA


# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
                sumTest = sum(someValue,na.rm=T))),

                envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
                sumTestTotal = sum(someValue,na.rm=T))),
                envir=data, enclos=parent.frame())

res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
 return(res)

}

test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)   
head(test)
head(test2)

As you can see I intend to run this over different categorical variables. In the example I have only two (category, categoryA) but in fact I got more, so using apply with my function would be really nice, but somehow it does not work correctly.

applytest <- head(apply(mydata[grep("^cat",
             names(mydata),value=T)],2,calcShares,data=mydata))   

.. returns a warning message and a strange name (newX[, i] ) for the category var.

So how can I do THIS a) more elegantly and b) fix the apply issue?

like image 881
Matt Bannert Avatar asked Jan 17 '12 16:01

Matt Bannert


1 Answers

This seems simple, so I may be missing some aspect of your question.

First, define a function that calculates the values you want inside each level of group. Then, instead of using .(group, size) to split the data.frame, use .(group), and apply the newly defined function to each of the split pieces.

library(plyr)

# Create a dataset with the names in your example
mydata <- warpbreaks
names(mydata) <- c("someValue", "group", "size")

# A function that calculates the proportional contribution of each size class 
# to the sum of someValue within a level of group
getProps <- function(df) {
    with(df, ave(someValue, size, FUN=sum)/sum(someValue))
}

# The call to ddply()
res <- ddply(mydata, .(group), 
             .fun = function(X) transform(X, PROPS=getProps(X)))

head(res, 12)
#    someValue group size     PROPS
# 1         26     A    L 0.4785203
# 2         30     A    L 0.4785203
# 3         54     A    L 0.4785203
# 4         25     A    L 0.4785203
# 5         70     A    L 0.4785203
# 6         52     A    L 0.4785203
# 7         51     A    L 0.4785203
# 8         26     A    L 0.4785203
# 9         67     A    L 0.4785203
# 10        18     A    M 0.2577566
# 11        21     A    M 0.2577566
# 12        29     A    M 0.2577566
like image 156
Josh O'Brien Avatar answered Sep 19 '22 21:09

Josh O'Brien