How do I preserve combinations of variables that do not appear in the input data when grouping with data.table?

Question

Using data.table package, is it possible to summarise data preserving combinations of variables that do not appear in the input?

With plyr package I know how to do this with the .drop argument, for example:

require(plyr)
df <- data.frame(categories = c(rep("A",3), rep("B",3), rep("C",3)), groups = c(rep(c("X", "Y"),4), "Z"), values = rep(1, 9))

df1 <- ddply(df, c("categories","groups"), .drop = F, summarise, sum = sum(values))

output:

 categories groups sum
1          A      X   2
2          A      Y   1
3          A      Z   0
4          B      X   1
5          B      Y   2
6          B      Z   0
7          C      X   1
8          C      Y   1
9          C      Z   1

In this case I preserve all groups/categories combinations even if its sum is 0.

Matt Dowle · Accepted Answer

Great question. Here are two ways. They both use by-without-by.

DT = as.data.table(df)
setkey(DT,categories,groups)
DT[CJ(unique(categories),unique(groups)), sum(values,na.rm=TRUE)]

   categories groups V1
1:          A      X  2
2:          A      Y  1
3:          A      Z  0
4:          B      X  1
5:          B      Y  2
6:          B      Z  0
7:          C      X  1
8:          C      Y  1
9:          C      Z  1

where CJ stands for Cross Join, see ?CJ. by-without-by just means that j gets executed on each group that each row of i joins to.

Admitedly it looks tricky on first sight. The idea is that if you have a known subset of groups, this syntax is faster than grouping everything and then selecting just the results from that you need. But in this case you'd like everything anyway so there's not much advantage, other than being able to lookup groups that don't exist in the data (which you can't do with by).

Another way is to by first as normal, then join the CJ() result to that :

DT[,sum(values),keyby='categories,groups'][CJ(unique(categories),unique(groups))]
   categories groups V1
1:          A      X  2
2:          A      Y  1
3:          A      Z NA
4:          B      X  1
5:          B      Y  2
6:          B      Z NA
7:          C      X  1
8:          C      Y  1
9:          C      Z  1

but then you get NA instead of the desired 0. Those could be replaced using set() if need be. The second way might be faster because the two unique calls are given much smaller input.

Both methods can be wrapped up into small helper functions if you do this a lot.

How do I preserve combinations of variables that do not appear in the input data when grouping with data.table?

Tags:

r

data.table

Davi Moreira

1 Answers

Matt Dowle

Recent Activity

Donate For Us

How do I preserve combinations of variables that do not appear in the input data when grouping with data.table?

Tags:

r

data.table

Davi Moreira

1 Answers

Matt Dowle

Related questions

Recent Activity

Donate For Us