I have a dataset which includes two key columns: id_1 and id_2. id_1 is unique while id_2 is not unique. id_2 is a string that contains ids separated by -. For instance:
| id_1 | id_2 |
|---|---|
| 1 | A-B-C |
| 2 | B-D |
| 3 | D-E |
| 4 | B |
| 5 | F |
| 6 | G |
What I want to achieve is to create a new id_3 that assigns a unique identifier to id_2, ensuring that any previously linked values share the same id_3. So for the example I would like to have output like this:
| id_1 | id_2 | id_3 |
|---|---|---|
| 1 | A-B-C | A |
| 2 | B-D | A |
| 3 | D-E | A |
| 4 | B | A |
| 5 | F | B |
| 6 | G | C |
I tried an inefficient approach to process the data using a for loop, but it doesn’t scale well. My dataset contains over 10M rows. It would be great if someone can share some thoughts. Really appreciate it.
As mentioned by @Gregor Thomas, you need to think of id_2 as being nodes connected within each group id_1. You can then get the clusters that corresponds to your id_3:
#Convert to `igraph` object and get the clusters
cl <-
dat |>
separate_longer_delim(id_2, "-") %>%
inner_join(., ., by = "id_1", relationship = "many-to-many") |>
select(from = id_2.x, to = id_2.y, id_1) |>
graph_from_data_frame(directed = FALSE) |>
components() |>
getElement("membership")
#Assign cluster number to the original data
dat |>
mutate(id_3 = cl[match(substr(dat$id_2, 1, 1), names(cl))])
# id_1 id_2 id_3
# 1 1 A-B-C 1
# 2 2 B-D 1
# 3 3 D-E 1
# 4 4 B 1
# 5 5 F 2
# 6 6 G 3
This is how your graph looks like:

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With