I have the following data.frame (in reality it's millions of records).
I'd like to be able to assign a group id when a phone number matches across any other record and any phone number position.
id <- c(1:5)
phone_1 <- c("444","", "333", "222", "")
phone_2 <- c("", "444", "111", "", "")
phone_3 <- c("222","", "", "", "111")
df <- data.frame(id, phone_1, phone_2, phone_3)
The ideal output would be the following:
print(df)
# id phone_1 phone_2 phone_3 ID
# 1 1 444 222 1
# 2 2 444 1
# 3 3 333 111 2
# 4 4 222 1
# 5 5 111 2
Ideally, I would like to use a data.table method as speed is important.
Note that missing values ("" or NA) should be treated as incomparables (or all of the above would be in the same group).
Here is a solution finding components of a graph:
library(tidyr)
library(dplyr)
library(igraph)
df$id <- paste0("id-", df$id)
graph <-
df %>%
gather(dummy, phone, -id) %>%
select(-dummy) %>%
filter(!is.na(phone) & phone != "") %>%
as.matrix() %>%
graph_from_edgelist(directed = FALSE)
plot(graph)

df$ID <- components(graph)$membership[df$id]
df
# id phone_1 phone_2 phone_3 ID
# 1 id-1 444 222 1
# 2 id-2 444 1
# 3 id-3 333 111 2
# 4 id-4 222 1
# 5 id-5 111 2
Following @Aurèle solution, if you want to compute it with data.table:
library(data.table)
library(igraph)
dt <- data.table(id, phone_1, phone_2, phone_3)
graph <- graph_from_edgelist(as.matrix(melt(dt, "id", names(dt)[grepl("phone", names(dt))])
[!is.na(value) & value != ""][, variable := NULL]),
directed = FALSE)
dt[, ID := components(graph)$membership[dt[, as.character(id)]]]
# id phone_1 phone_2 phone_3 ID
# 1: 1 444 222 1
# 2: 2 444 1
# 3: 3 333 111 2
# 4: 4 222 1
# 5: 5 111 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With