R data.table conditional aggregation

Question

I'm faced with (what I think) is a tough problem with aggregations on data.table I've the following data.table

structure(list(id1 = c("a", "a", "a", "b", "b", "c", "c"), id2 = c("x", 
"y", "z", "x", "u", "y", "z"), val = c(2, 1, 2, 1, 3, 4, 3)), .Names = c("id1", 
"id2", "val"), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x1f66a78>)

I would like to create conditional aggregates on the val column for this data based on the second column id2. The way the aggregation is done is to only include id1 groups which have at least one element from a given id2 element. I'll step through an example to show what I mean.

The conditional aggregate for x (the first row 2nd column) would include val values 2,1,2 for id1 = a and val values = 1,3 from id1 = b because id2=x exists for them but no values from id1=c, resulting in a value of 2 + 1 + 2 + 1 + 3 = 9. I want the 9 as a 4th column in every row where id2 = x appears.

Likewise, I want to do this for all id2 values. So the final output would be

    id1 id2 val c.sum
1:   a   x   2     9
2:   a   y   1    12
3:   a   z   2    12
4:   b   x   1     9
5:   b   u   3     4
6:   c   y   4    12
7:   c   z   3    14

Is this possible in R, data.table? Or any other package/method? Thanks in advance

Marat Talipov · Accepted Answer

Given that d is your input structure:

library(data.table)

d[,c.sum:=sum(d$val[d$id1 %in% id1]),by=id2][]

How it works: by=id2 groups input data table d by id2; d$id1 %in% id1 selects rows in d whose id1 matches id1 of the group under consideration; sum(d$val[...]) takes sum of values from such rows; finally, c.sum:=sum(...) adds a column c.sum to d. The ending [] are needed only for the printing purpose.

The output is:

#    id1 id2 val c.sum
# 1:   a   x   2     9
# 2:   a   y   1    12
# 3:   a   z   2    12
# 4:   b   x   1     9
# 5:   b   u   3     4
# 6:   c   y   4    12
# 7:   c   z   3    12

Jthorpe · Answer

This is a bit brute force, but it should work (assuming data is your data structure):

id1_sums <- tapply(data$val,data$id1,sum)  
for(id in unique(data$id2))
    data$c.sum[data$id2  == id] <- sum(
            id1_sums[which(names(id1_sums) %in% data$id1[data$id2 == id])])

R data.table conditional aggregation

Tags:

r

data.table

broccoli

2 Answers

Marat Talipov

Jthorpe

Recent Activity

Donate For Us

R data.table conditional aggregation

Tags:

r

data.table

broccoli

2 Answers

Marat Talipov

Jthorpe

Related questions

Recent Activity

Donate For Us