In the data.table
below, I have information on the composition of teams participating to projects. The variable id
tells the team id while the variable event
gives the project number. The variable freqrel
describes the composition of the teams (you can see that freqrel adds up to 1 within every team).
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), event = c("127b", "127b", "127b", "127b",
"127b", "127b", "127b", "127b", "127b", "125t", "125t", "125t",
"125t", "125t", "125t"), membr = c("engineer", "mathematician",
"physicist", "mathematician", "physicist", "surgeon", "dentist",
"mathematician", "programmer", "physicist", "sociologist", "surgeon",
"musician", "sociologist", "surgeon"), freqrel = c(0.4, 0.4,
0.2, 0.166666666666667, 0.5, 0.333333333333333, 0.333333333333333,
0.5, 0.166666666666667, 0.75, 0.125, 0.125, 0.444444444444444,
0.444444444444444, 0.111111111111111)), .Names = c("id", "event",
"membr", "freqrel"), row.names = c(NA, -15L), class = c("data.table",
"data.frame"), sorted = c("id", "event"), .internal.selfref = <pointer: 0x039a24a0>)
The way I see the data are split into nested groups. The first division occurs at the project level (straight line) and the second at the team level (dashed line).
id event membr freqrel
1: 1 127b engineer 0.4000000
2: 1 127b mathematician 0.4000000
3: 1 127b physicist 0.2000000
--------------------------------------
4: 2 127b mathematician 0.1666667
5: 2 127b physicist 0.5000000
6: 2 127b surgeon 0.3333333
--------------------------------------
7: 3 127b dentist 0.3333333
8: 3 127b mathematician 0.5000000
9: 3 127b programmer 0.1666667
_____________________________________
10: 4 125t physicist 0.7500000
11: 4 125t sociologist 0.1250000
12: 4 125t surgeon 0.1250000
--------------------------------------
13: 5 125t musician 0.4444444
14: 5 125t sociologist 0.4444444
15: 5 125t surgeon 0.1111111
From this starting condition I would like to make teams within the same project perfectly comparable by adding to each of them also the membr
types that the team doesn't feature, assigning them freqrel=0. The result should be this:
id event membr freqrel
1: 1 127b dentist 0.0000000
2: 1 127b engineer 0.4000000
3: 1 127b mathematician 0.4000000
4: 1 127b physicist 0.2000000
5: 1 127b programmer 0.0000000
6: 1 127b surgeon 0.0000000
--------------------------------------
7: 2 127b dentist 0.0000000
8: 2 127b engineer 0.0000000
9: 2 127b mathematician 0.1666667
10: 2 127b physicist 0.5000000
11: 2 127b programmer 0.0000000
12: 2 127b surgeon 0.3333333
--------------------------------------
13: 3 127b dentist 0.3333333
14: 3 127b engineer 0.0000000
15: 3 127b mathematician 0.5000000
16: 3 127b physicist 0.0000000
17: 3 127b programmer 0.1666667
18: 3 127b surgeon 0.0000000
_____________________________________
19: 4 125t musician 0.0000000
20: 4 125t physicist 0.7500000
21: 4 125t sociologist 0.1250000
22: 4 125t surgeon 0.1250000
--------------------------------------
23: 5 125t musician 0.4444444
24: 5 125t physicist 0.0000000
25: 5 125t sociologist 0.4444444
26: 5 125t surgeon 0.1111111
In other words, after dividing the data with by
using event
as a key, I need to divide a second time and compare the chunks of data obtained with the second splitting.
But here the problem is that I don't know how to reference the first chunk obtained with by
and then how to split again and do the comparisons among the pieces of the database. Do you have an idea how I could sort this out?
I'd be extremely thankful if you could help me. Really.
Here's an easy way:
setkey(dt, id, membr)
ans <- dt[, .SD[CJ(unique(id), unique(membr))], by=list(event)]
Then, you can just replace the NA
with 0's as follows:
ans[is.na(freqrel), freqrel := 0.0]
Some explanation: Your problem boils down to this - for every event
, you want all possible combinations of id, membr
so that you can then perform a join on this all-combination within that grouping using .SD
.
So, first we group by event
, and within that, we first get all combinations of id, membr
with the help of CJ
(which will have a key set to all columns by default). However, to perform a join we need to have the key set for .SD
. Therefore, we set the key
for dt
to id, membr
upfront. Thus, we perform a join within each group and that gives you the intended result. Hope this helps a bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With