Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between ddply and aggregate

Tags:

r

Can someone help me get the difference between aggregate and ddply with the following example:

A data frame:

mydat <- data.frame(first = rpois(10,10), second = rpois(10,10), 
                    third = rpois(10,10), group = c(rep("a",5),rep("b",5)))

Use aggregate to apply a function to a part of the data frame split by a factor:

aggregate(mydat[,1:3], by=list(mydat$group), mean)
  Group.1 first second third
1       a   8.8    8.8  10.2
2       b   6.8    9.4  13.4

Try to use aggregate for another function (returns an error message):

aggregate(mydat[,1:3], by=list(mydat$group), function(u) cor(u$first,u$second))
Error in u$second : $ operator is invalid for atomic vectors

Now, try the same with ddply (plyr package):

ddply(mydat, .(group), function(u) cor(u$first,u$second))
  group         V1
1     a -0.5083042
2     b -0.6329968

All tips, links, criticism are highly appreciated.

like image 730
skip Avatar asked Jan 05 '13 21:01

skip


2 Answers

aggregate calls FUN on each column independently, which is why you get independent means. ddply is going to pass all columns to the function. A quick demonstration of what is being passed in aggregate may be in order:

Some sample data for demonstration:

d <- data.frame(a=1:4, b=5:8, c=c(1,1,2,2))

> d
  a b c
1 1 5 1
2 2 6 1
3 3 7 2
4 4 8 2

By using the function print and ignoring the result of the commands aggregate or ddply, we can see what gets passed to the function in each iteration.

aggregate:

tmp <- aggregate(d[1:2], by=list(d$c), print)
[1] 1 2
[1] 3 4
[1] 5 6
[1] 7 8

Note that individual columns are sent to print.

ddply:

tmp <- ddply(d, .(c), print)
  a b c
1 1 5 1
2 2 6 1
  a b c
3 3 7 2
4 4 8 2

Note that data frames are being sent to print.

like image 82
Matthew Lundberg Avatar answered Sep 30 '22 02:09

Matthew Lundberg


You've already been told why aggregate was the wrong {base} function to use for a function that requires two vectors as arguments, but you haven't yet been told which non-ddply approach would have succeeded.

The by( ... grp, FUN) method:

> cbind (by( mydat, mydat["group"], function(d) cor(d$first, d$second)) )
        [,1]
a  0.6529822
b -0.1964186

The sapply(split( ..., grp), fn) method

> sapply(  split( mydat, mydat["group"]), function(d) cor(d$first, d$second)) 
         a          b 
 0.6529822 -0.1964186 
like image 30
IRTFM Avatar answered Sep 30 '22 02:09

IRTFM