Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently link multiple rows

Tags:

python

r

igraph

I have a dataset which includes two key columns: id_1 and id_2. id_1 is unique while id_2 is not unique. id_2 is a string that contains ids separated by -. For instance:

id_1 id_2
1 A-B-C
2 B-D
3 D-E
4 B
5 F
6 G

What I want to achieve is to create a new id_3 that assigns a unique identifier to id_2, ensuring that any previously linked values share the same id_3. So for the example I would like to have output like this:

id_1 id_2 id_3
1 A-B-C A
2 B-D A
3 D-E A
4 B A
5 F B
6 G C

I tried an inefficient approach to process the data using a for loop, but it doesn’t scale well. My dataset contains over 10M rows. It would be great if someone can share some thoughts. Really appreciate it.

like image 973
Jerry Zhang Avatar asked Apr 19 '26 16:04

Jerry Zhang


1 Answers

As mentioned by @Gregor Thomas, you need to think of id_2 as being nodes connected within each group id_1. You can then get the clusters that corresponds to your id_3:

#Convert to `igraph` object and get the clusters
cl <- 
  dat |> 
  separate_longer_delim(id_2, "-") %>% 
  inner_join(., ., by = "id_1", relationship = "many-to-many") |> 
  select(from = id_2.x, to = id_2.y, id_1) |> 
  graph_from_data_frame(directed = FALSE) |> 
  components() |> 
  getElement("membership")

#Assign cluster number to the original data
dat |> 
  mutate(id_3 = cl[match(substr(dat$id_2, 1, 1), names(cl))])

#   id_1  id_2 id_3
# 1    1 A-B-C    1
# 2    2   B-D    1
# 3    3   D-E    1
# 4    4     B    1
# 5    5     F    2
# 6    6     G    3

This is how your graph looks like: enter image description here

like image 67
Maël Avatar answered Apr 21 '26 05:04

Maël



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!