Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

faster way to create variable that aggregates a column by id [duplicate]

Is there a faster way to do this? I guess this is unnecessary slow and that a task like this can be accomplished with base functions.

df <- ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc)))

I'm quite new to R. I have looked at by(), aggregate() and tapply(), but didn't get them to work at all or in the way I wanted. Rather than returning a shorter vector, I want to attach the sum to the original dataframe. What is the best way to do this?

Edit: Here is a speed comparison of the answers applied to my data.

> # My original solution
> system.time( ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc))) )
   user  system elapsed 
 14.405   0.000  14.479 

> # Paul Hiemstra
> system.time( ddply(df, "id", transform, perc.total = sum(cand.perc)) )
   user  system elapsed 
 15.973   0.000  15.992 

> # Richie Cotton
> system.time( with(df, tapply(df$cand.perc, df$id, sum))[df$id] )
   user  system elapsed 
  0.048   0.000   0.048 

> # John
> system.time( with(df, ave(cand.perc, id, FUN = sum)) )
       user  system elapsed 
      0.032   0.000   0.030 

> # Christoph_J
> system.time( df[ , list(perc.total = sum(cand.perc)), by="id"][df])
   user  system elapsed 
  0.028   0.000   0.028 
like image 764
ilprincipe Avatar asked Nov 22 '11 10:11

ilprincipe


3 Answers

Since you are quite new to R and speed is apparently an issue for you, I recommend the data.table package, which is really fast. One way to solve your problem in one line is as follows:

library(data.table)
DT <- data.table(ID = rep(c(1:3), each=3),
                 cand.perc = 1:9,
                 key="ID")
DT <- DT[ , perc.total := sum(cand.perc), by = ID]
DT
      ID Perc.total cand.perc
 [1,]  1          6         1
 [2,]  1          6         2
 [3,]  1          6         3
 [4,]  2         15         4
 [5,]  2         15         5
 [6,]  2         15         6
 [7,]  3         24         7
 [8,]  3         24         8
 [9,]  3         24         9

Disclaimer: I'm not a data.table expert (yet ;-), so there might faster ways to do that. Check out the package site to get you started if you are interested in using the package: http://datatable.r-forge.r-project.org/

like image 106
Christoph_J Avatar answered Nov 15 '22 17:11

Christoph_J


For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.

df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)
like image 6
John Avatar answered Nov 15 '22 16:11

John


Use tapply to get the group stats, then add them back into your dataset afterwards.

Reproducible example:

means_by_wool <- with(warpbreaks, tapply(breaks, wool, mean))
warpbreaks$means.by.wool <- means_by_wool[warpbreaks$wool]

Untested solution for your scenario:

sum_by_id <- with(df, tapply(cand.perc, id, sum))
df$perc.total <- sum_by_id[df$id]
like image 3
Richie Cotton Avatar answered Nov 15 '22 17:11

Richie Cotton