Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nested groupings with data.table

Tags:

r

data.table

In the data.table below, I have information on the composition of teams participating to projects. The variable id tells the team id while the variable event gives the project number. The variable freqrel describes the composition of the teams (you can see that freqrel adds up to 1 within every team).

structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 
4L, 4L, 5L, 5L, 5L), event = c("127b", "127b", "127b", "127b", 
"127b", "127b", "127b", "127b", "127b", "125t", "125t", "125t", 
"125t", "125t", "125t"), membr = c("engineer", "mathematician", 
"physicist", "mathematician", "physicist", "surgeon", "dentist", 
"mathematician", "programmer", "physicist", "sociologist", "surgeon", 
"musician", "sociologist", "surgeon"), freqrel = c(0.4, 0.4, 
0.2, 0.166666666666667, 0.5, 0.333333333333333, 0.333333333333333, 
0.5, 0.166666666666667, 0.75, 0.125, 0.125, 0.444444444444444, 
0.444444444444444, 0.111111111111111)), .Names = c("id", "event", 
"membr", "freqrel"), row.names = c(NA, -15L), class = c("data.table", 
"data.frame"), sorted = c("id", "event"), .internal.selfref = <pointer: 0x039a24a0>)

The way I see the data are split into nested groups. The first division occurs at the project level (straight line) and the second at the team level (dashed line).

    id event         membr   freqrel
 1:  1  127b      engineer 0.4000000
 2:  1  127b mathematician 0.4000000
 3:  1  127b     physicist 0.2000000
--------------------------------------
 4:  2  127b mathematician 0.1666667
 5:  2  127b     physicist 0.5000000
 6:  2  127b       surgeon 0.3333333
--------------------------------------
 7:  3  127b       dentist 0.3333333
 8:  3  127b mathematician 0.5000000
 9:  3  127b    programmer 0.1666667
_____________________________________
10:  4  125t     physicist 0.7500000
11:  4  125t   sociologist 0.1250000
12:  4  125t       surgeon 0.1250000
--------------------------------------
13:  5  125t      musician 0.4444444
14:  5  125t   sociologist 0.4444444
15:  5  125t       surgeon 0.1111111

From this starting condition I would like to make teams within the same project perfectly comparable by adding to each of them also the membr types that the team doesn't feature, assigning them freqrel=0. The result should be this:

    id event         membr   freqrel
 1:  1  127b       dentist 0.0000000  
 2:  1  127b      engineer 0.4000000
 3:  1  127b mathematician 0.4000000
 4:  1  127b     physicist 0.2000000
 5:  1  127b    programmer 0.0000000
 6:  1  127b       surgeon 0.0000000
--------------------------------------
 7:  2  127b       dentist 0.0000000  
 8:  2  127b      engineer 0.0000000
 9:  2  127b mathematician 0.1666667
 10: 2  127b     physicist 0.5000000
 11: 2  127b    programmer 0.0000000
 12: 2  127b       surgeon 0.3333333    
--------------------------------------
 13: 3  127b       dentist 0.3333333 
 14: 3  127b      engineer 0.0000000
 15: 3  127b mathematician 0.5000000
 16: 3  127b     physicist 0.0000000
 17: 3  127b    programmer 0.1666667
 18: 3  127b       surgeon 0.0000000   
_____________________________________
 19: 4  125t      musician 0.0000000
 20: 4  125t     physicist 0.7500000
 21: 4  125t   sociologist 0.1250000
 22: 4  125t       surgeon 0.1250000
--------------------------------------
 23: 5  125t      musician 0.4444444
 24: 5  125t     physicist 0.0000000
 25: 5  125t   sociologist 0.4444444
 26: 5  125t       surgeon 0.1111111

In other words, after dividing the data with by using event as a key, I need to divide a second time and compare the chunks of data obtained with the second splitting. But here the problem is that I don't know how to reference the first chunk obtained with by and then how to split again and do the comparisons among the pieces of the database. Do you have an idea how I could sort this out?

I'd be extremely thankful if you could help me. Really.

like image 341
Riccardo Avatar asked Mar 21 '23 14:03

Riccardo


1 Answers

Here's an easy way:

setkey(dt, id, membr)
ans <- dt[, .SD[CJ(unique(id), unique(membr))], by=list(event)]

Then, you can just replace the NA with 0's as follows:

ans[is.na(freqrel), freqrel := 0.0]

Some explanation: Your problem boils down to this - for every event, you want all possible combinations of id, membr so that you can then perform a join on this all-combination within that grouping using .SD.

So, first we group by event, and within that, we first get all combinations of id, membr with the help of CJ (which will have a key set to all columns by default). However, to perform a join we need to have the key set for .SD. Therefore, we set the key for dt to id, membr upfront. Thus, we perform a join within each group and that gives you the intended result. Hope this helps a bit.

like image 103
Arun Avatar answered Apr 05 '23 02:04

Arun