Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table conditional aggregation

Tags:

r

data.table

I'm faced with (what I think) is a tough problem with aggregations on data.table I've the following data.table

structure(list(id1 = c("a", "a", "a", "b", "b", "c", "c"), id2 = c("x", 
"y", "z", "x", "u", "y", "z"), val = c(2, 1, 2, 1, 3, 4, 3)), .Names = c("id1", 
"id2", "val"), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x1f66a78>)

I would like to create conditional aggregates on the val column for this data based on the second column id2. The way the aggregation is done is to only include id1 groups which have at least one element from a given id2 element. I'll step through an example to show what I mean.

The conditional aggregate for x (the first row 2nd column) would include val values 2,1,2 for id1 = a and val values = 1,3 from id1 = b because id2=x exists for them but no values from id1=c, resulting in a value of 2 + 1 + 2 + 1 + 3 = 9. I want the 9 as a 4th column in every row where id2 = x appears.

Likewise, I want to do this for all id2 values. So the final output would be

    id1 id2 val c.sum
1:   a   x   2     9
2:   a   y   1    12
3:   a   z   2    12
4:   b   x   1     9
5:   b   u   3     4
6:   c   y   4    12
7:   c   z   3    14

Is this possible in R, data.table? Or any other package/method? Thanks in advance

like image 589
broccoli Avatar asked Jan 15 '15 01:01

broccoli


2 Answers

Given that d is your input structure:

library(data.table)

d[,c.sum:=sum(d$val[d$id1 %in% id1]),by=id2][]

How it works: by=id2 groups input data table d by id2; d$id1 %in% id1 selects rows in d whose id1 matches id1 of the group under consideration; sum(d$val[...]) takes sum of values from such rows; finally, c.sum:=sum(...) adds a column c.sum to d. The ending [] are needed only for the printing purpose.

The output is:

#    id1 id2 val c.sum
# 1:   a   x   2     9
# 2:   a   y   1    12
# 3:   a   z   2    12
# 4:   b   x   1     9
# 5:   b   u   3     4
# 6:   c   y   4    12
# 7:   c   z   3    12
like image 113
Marat Talipov Avatar answered Sep 30 '22 06:09

Marat Talipov


This is a bit brute force, but it should work (assuming data is your data structure):

id1_sums <- tapply(data$val,data$id1,sum)  
for(id in unique(data$id2))
    data$c.sum[data$id2  == id] <- sum(
            id1_sums[which(names(id1_sums) %in% data$id1[data$id2 == id])])
like image 25
Jthorpe Avatar answered Sep 30 '22 06:09

Jthorpe