Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I preserve combinations of variables that do not appear in the input data when grouping with data.table?

Tags:

r

data.table

Using data.table package, is it possible to summarise data preserving combinations of variables that do not appear in the input?

With plyr package I know how to do this with the .drop argument, for example:

require(plyr)
df <- data.frame(categories = c(rep("A",3), rep("B",3), rep("C",3)), groups = c(rep(c("X", "Y"),4), "Z"), values = rep(1, 9))

df1 <- ddply(df, c("categories","groups"), .drop = F, summarise, sum = sum(values))

output:

 categories groups sum
1          A      X   2
2          A      Y   1
3          A      Z   0
4          B      X   1
5          B      Y   2
6          B      Z   0
7          C      X   1
8          C      Y   1
9          C      Z   1

In this case I preserve all groups/categories combinations even if its sum is 0.

like image 668
Davi Moreira Avatar asked Jan 23 '13 17:01

Davi Moreira


1 Answers

Great question. Here are two ways. They both use by-without-by.

DT = as.data.table(df)
setkey(DT,categories,groups)
DT[CJ(unique(categories),unique(groups)), sum(values,na.rm=TRUE)]

   categories groups V1
1:          A      X  2
2:          A      Y  1
3:          A      Z  0
4:          B      X  1
5:          B      Y  2
6:          B      Z  0
7:          C      X  1
8:          C      Y  1
9:          C      Z  1

where CJ stands for Cross Join, see ?CJ. by-without-by just means that j gets executed on each group that each row of i joins to.

Admitedly it looks tricky on first sight. The idea is that if you have a known subset of groups, this syntax is faster than grouping everything and then selecting just the results from that you need. But in this case you'd like everything anyway so there's not much advantage, other than being able to lookup groups that don't exist in the data (which you can't do with by).

Another way is to by first as normal, then join the CJ() result to that :

DT[,sum(values),keyby='categories,groups'][CJ(unique(categories),unique(groups))]
   categories groups V1
1:          A      X  2
2:          A      Y  1
3:          A      Z NA
4:          B      X  1
5:          B      Y  2
6:          B      Z NA
7:          C      X  1
8:          C      Y  1
9:          C      Z  1

but then you get NA instead of the desired 0. Those could be replaced using set() if need be. The second way might be faster because the two unique calls are given much smaller input.

Both methods can be wrapped up into small helper functions if you do this a lot.

like image 110
Matt Dowle Avatar answered Nov 04 '22 23:11

Matt Dowle