Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using two grouping designations to create one 'combined' grouping variable

Tags:

algorithm

r

Given a data.frame:

df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
                 grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))

#> df
#   grp1 grp2
#1     1    1
#2     1    2
#3     1    3
#4     2    3
#5     2    4
#6     2    5
#7     3    6
#8     3    7
#9     3    8
#10    4    6
#11    4    9
#12    4   10

Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.

Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.

Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).

Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.

The final output that I'm hoping to obtain would look something like:

# > df
#    grp1 grp2 combinedGrp
# 1     1    1           1
# 2     1    2           1
# 3     1    3           1
# 4     2    3           1
# 5     2    4           1
# 6     2    5           1
# 7     3    6           2
# 8     3    7           2
# 9     3    8           2
# 10    4    6           2
# 11    4    9           2
# 12    4   10           2

Thank you for any direction on this topic!

like image 389
Andrew Taylor Avatar asked Dec 19 '22 17:12

Andrew Taylor


1 Answers

I would define a graph and label nodes according to connected components:

gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))

oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i)  with(gmap[gmap$ind == i, ], 
  node[ match(df[[i]], values) ]
))

library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership

df$group = gmap$group[ match(df$node_grp1, gmap$node) ]


   grp1 grp2 node_grp1 node_grp2 group
1     1    1         1         5     1
2     1    2         1         6     1
3     1    3         1         7     1
4     2    3         2         7     1
5     2    4         2         8     1
6     2    5         2         9     1
7     3    6         3        10     2
8     3    7         3        11     2
9     3    8         3        12     2
10    4    6         4        10     2
11    4    9         4        13     2
12    4   10         4        14     2

Each unique element of grp1 or grp2 is a node and each row of df is an edge.

like image 97
Frank Avatar answered May 06 '23 16:05

Frank