I try to find the "group" (id3
) based on two variables (id1
, id2
):
df = data.frame(id1 = c(1,1,2,2,3,3,4,4,5,5),
id2 = c('a','b','a','c','c','d','x','y','y','z'),
id3 = c(rep('group1',6), rep('group2',4)))
id1 id2 id3
1 1 a group1
2 1 b group1
3 2 a group1
4 2 c group1
5 3 c group1
6 3 d group1
7 4 x group2
8 4 y group2
9 5 y group2
10 5 z group2
For example id1=1
is related to a
and b
of id2
. But id1=2
is also related to a
so both belong to one group (id3=group1
). But since id1=2
and id1=3
share id2=c
, also id1=3
belongs to that group (id3=1
). The values of the tuple ((1,2),('a','b','c'))
appear no where else, so no other row belongs to that group (which is labeled group1
generically).
If you need to take care of NA
s, check this similar post
My idea was to create a table based on id3
which would subsequently populated in a loop.
solution = data.frame(id3= c('group1', 'group2'),id1=NA, id2=NA)
group= 1
for (step in c(1:1000)) { # run many steps to make sure to get all values
solution$id1[group] = # populate
solution$id2[group] = # populate
if (fully populated) {
group = group +1
}}
I am struggling to see how to populate.
Disclaimer: I asked a similar question here, but using names in id2
led a lot of people point me to fuzzy string procedures in R, which are not needed here, since there exist an exact solution. I also include all code I have tried since then in this post.
You can leverage on igraph
to find the different clusters of networks
library(igraph)
g <- graph_from_data_frame(df, FALSE)
cg <- clusters(g)$membership
df$id3 <- cg[df$id1]
df
output:
id1 id2 id3
1 1 a 1
2 1 b 1
3 2 a 1
4 2 c 1
5 3 c 1
6 3 d 1
7 4 x 2
8 4 y 2
9 5 y 2
10 5 z 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With