How to efficiently link multiple rows

Question

I have a dataset which includes two key columns: id_1 and id_2. id_1 is unique while id_2 is not unique. id_2 is a string that contains ids separated by -. For instance:

id_1	id_2
1	A-B-C
2	B-D
3	D-E
4	B
5	F
6	G

What I want to achieve is to create a new id_3 that assigns a unique identifier to id_2, ensuring that any previously linked values share the same id_3. So for the example I would like to have output like this:

id_1	id_2	id_3
1	A-B-C	A
2	B-D	A
3	D-E	A
4	B	A
5	F	B
6	G	C

I tried an inefficient approach to process the data using a for loop, but it doesn’t scale well. My dataset contains over 10M rows. It would be great if someone can share some thoughts. Really appreciate it.

Maël · Accepted Answer

As mentioned by @Gregor Thomas, you need to think of id_2 as being nodes connected within each group id_1. You can then get the clusters that corresponds to your id_3:

#Convert to `igraph` object and get the clusters
cl <- 
  dat |> 
  separate_longer_delim(id_2, "-") %>% 
  inner_join(., ., by = "id_1", relationship = "many-to-many") |> 
  select(from = id_2.x, to = id_2.y, id_1) |> 
  graph_from_data_frame(directed = FALSE) |> 
  components() |> 
  getElement("membership")

#Assign cluster number to the original data
dat |> 
  mutate(id_3 = cl[match(substr(dat$id_2, 1, 1), names(cl))])

#   id_1  id_2 id_3
# 1    1 A-B-C    1
# 2    2   B-D    1
# 3    3   D-E    1
# 4    4     B    1
# 5    5     F    2
# 6    6     G    3

This is how your graph looks like: enter image description here

How to efficiently link multiple rows

Tags:

python

r

igraph

Jerry Zhang

1 Answers

Maël

Recent Activity

Donate For Us

How to efficiently link multiple rows

Tags:

python

r

igraph

Jerry Zhang

1 Answers

Maël

Related questions

Recent Activity

Donate For Us