I have the following data.frame (in reality it's millions of records).
I'd like to be able to assign a group id when a phone number matches across any other record and any phone number position.
id <- c(1:5)
phone_1 <- c("444","", "333", "222", "")
phone_2 <- c("", "444", "111", "", "")
phone_3 <- c("222","", "", "", "111")
df <- data.frame(id, phone_1, phone_2, phone_3)
The ideal output would be the following:
print(df)
# id phone_1 phone_2 phone_3 ID
# 1 1 444 222 1
# 2 2 444 1
# 3 3 333 111 2
# 4 4 222 1
# 5 5 111 2
Ideally, I would like to use a data.table
method as speed is important.
Note that missing values (""
or NA
) should be treated as incomparables (or all of the above would be in the same group).
Here is a solution finding components of a graph:
library(tidyr)
library(dplyr)
library(igraph)
df$id <- paste0("id-", df$id)
graph <-
df %>%
gather(dummy, phone, -id) %>%
select(-dummy) %>%
filter(!is.na(phone) & phone != "") %>%
as.matrix() %>%
graph_from_edgelist(directed = FALSE)
plot(graph)
df$ID <- components(graph)$membership[df$id]
df
# id phone_1 phone_2 phone_3 ID
# 1 id-1 444 222 1
# 2 id-2 444 1
# 3 id-3 333 111 2
# 4 id-4 222 1
# 5 id-5 111 2
Following @Aurèle
solution, if you want to compute it with data.table
:
library(data.table)
library(igraph)
dt <- data.table(id, phone_1, phone_2, phone_3)
graph <- graph_from_edgelist(as.matrix(melt(dt, "id", names(dt)[grepl("phone", names(dt))])
[!is.na(value) & value != ""][, variable := NULL]),
directed = FALSE)
dt[, ID := components(graph)$membership[dt[, as.character(id)]]]
# id phone_1 phone_2 phone_3 ID
# 1: 1 444 222 1
# 2: 2 444 1
# 3: 3 333 111 2
# 4: 4 222 1
# 5: 5 111 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With