I want to find an efficient method of determining an entire hierarchy type relationship for a table of number pairs, then express that relationship in a vector, or string, so that I can determine other useful information about each pair's hierarchy, such as the highest related integer, lowest related integer and total number of related integers.
For example I have a table of integer pairs:
X Y
--- ---
5 10
5 11
11 12
11 13
13 3
20 18
17 18
50 18
20 21
A record is related to another record if any value in the pair is shared by any other value in another pair. The final table would look something like this:
X Y Related ID's
--- --- ---------------
5 10 3,5,10,11,12,13
5 11 3,5,10,11,12,13
11 12 3,5,10,11,12,13
11 13 3,5,10,11,12,13
13 3 3,5,10,11,12,13
20 18 17,18,20,21,50
17 18 17,18,20,21,50
50 18 17,18,20,21,50
20 21 17,18,20,21,50
What I have now is admittedly a mess. It uses a fuzzy_join with a matching function that takes x,y as a vector and does a match between them. That match then creates a larger vector of all four matching numbers, which goes back into the fuzzy_join to do the match again. This loops until there are no more matches. It gets terrible very quickly, and at about 4k records it just doesn't respond anymore. The entire initial table of pairs will stay < 100k records
in base R you could do:
relation <- function(dat){
.relation <- function(x){
k = unique(sort(c(dat[dat[, 1] %in% x, 2], x, dat[dat[, 2] %in% x, 1])))
if(setequal(x,k)) toString(k) else .relation(k)}
sapply(dat[,1],.relation)
}
df$related <- relation(df)
df
X Y related
1 5 10 3, 5, 10, 11, 12, 13
2 5 11 3, 5, 10, 11, 12, 13
3 11 12 3, 5, 10, 11, 12, 13
4 11 13 3, 5, 10, 11, 12, 13
5 13 3 3, 5, 10, 11, 12, 13
6 20 18 17, 18, 20, 21, 50
7 17 18 17, 18, 20, 21, 50
8 50 18 17, 18, 20, 21, 50
9 20 21 17, 18, 20, 21, 50
If you have igraph
installed you could do:
library(igraph)
a <- components(graph_from_data_frame(df, FALSE))$membership
b <- tapply(names(a),a,toString)
df$related <- b[a[as.character(df$X)]]
EDIT:
If we are comparing the speed of the functions, then note that in my function above, the last statement ie sapply(dat[,1], ...)
computes the values for each element even after computing it before. eg sapply(c(5,5,5,5)...)
will compute the group 4 times instead of just once. Now use:
relation <- function(dat){
.relation <- function(x){
k <- unique(c(dat[dat[, 1] %in% x, 2], x, dat[dat[, 2] %in% x, 1]))
if(setequal(x,k)) sort(k) else .relation(k)}
d <- unique(dat[,1])
m <- setNames(character(length(d)),d)
while(length(d) > 0){
s <- .relation(d[1])
m[as.character(s)] <- toString(s)
d <- d[!d%in%s]
}
dat$groups <- m[as.character(dat[,1])]
dat
}
Now do the benchmark:
df1 <- do.call(rbind,rep(list(df),100))
microbenchmark::microbenchmark(relation(df1), group_pairs(df1),unit = "relative")
microbenchmark::microbenchmark(relation(df1), group_pairs(df1))
Unit: milliseconds
expr min lq mean median uq max neval
relation(df1) 1.0909 1.17175 1.499096 1.27145 1.6580 3.2062 100
group_pairs(df1) 153.3965 173.54265 199.559206 190.62030 213.7964 424.8309 100
Another option with igraph
library(igraph)
clt <- clusters(graph_from_data_frame(df,directed = FALSE))$membership
within(df, ID <- ave(names(clt),clt,FUN = toString)[match(as.character(X),names(clt))])
such that
X Y ID
1 5 10 5, 11, 13, 10, 12, 3
2 5 11 5, 11, 13, 10, 12, 3
3 11 12 5, 11, 13, 10, 12, 3
4 11 13 5, 11, 13, 10, 12, 3
5 13 3 5, 11, 13, 10, 12, 3
6 20 18 20, 17, 50, 18, 21
7 17 18 20, 17, 50, 18, 21
8 50 18 20, 17, 50, 18, 21
9 20 21 20, 17, 50, 18, 21
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With