Assigning groupid across shared common columns

Question

I have the following data.frame (in reality it's millions of records).

I'd like to be able to assign a group id when a phone number matches across any other record and any phone number position.

id <- c(1:5)
phone_1 <- c("444","", "333", "222", "")
phone_2 <- c("", "444", "111", "", "")
phone_3 <- c("222","", "", "", "111")
df <- data.frame(id, phone_1, phone_2, phone_3)

The ideal output would be the following:

print(df)
#   id phone_1 phone_2 phone_3  ID
# 1  1     444             222  1
# 2  2             444          1
# 3  3     333     111          2
# 4  4     222                  1
# 5  5                     111  2

Ideally, I would like to use a data.table method as speed is important.

Note that missing values ("" or NA) should be treated as incomparables (or all of the above would be in the same group).

Aurèle · Accepted Answer

Here is a solution finding components of a graph:

library(tidyr)
library(dplyr)
library(igraph)

df$id <- paste0("id-", df$id)

graph <- 
  df %>%
  gather(dummy, phone, -id) %>% 
  select(-dummy) %>% 
  filter(!is.na(phone) & phone != "") %>% 
  as.matrix() %>% 
  graph_from_edgelist(directed = FALSE)

plot(graph)

enter image description here

df$ID <- components(graph)$membership[df$id]

df

#     id phone_1 phone_2 phone_3 ID
# 1 id-1     444             222  1
# 2 id-2             444          1
# 3 id-3     333     111          2
# 4 id-4     222                  1
# 5 id-5                     111  2

ebeneditos · Answer

Following @Aurèle solution, if you want to compute it with data.table:

library(data.table)
library(igraph)

dt <- data.table(id, phone_1, phone_2, phone_3)
graph <- graph_from_edgelist(as.matrix(melt(dt, "id", names(dt)[grepl("phone", names(dt))])
                                       [!is.na(value) & value != ""][, variable := NULL]),
                             directed = FALSE)

dt[, ID := components(graph)$membership[dt[, as.character(id)]]]

#    id phone_1 phone_2 phone_3 ID
# 1:  1     444             222  1
# 2:  2             444          1
# 3:  3     333     111          2
# 4:  4     222                  1
# 5:  5                     111  2

Assigning groupid across shared common columns

Tags:

r

data.table

Khaynes

2 Answers

Aurèle

ebeneditos

Recent Activity

Donate For Us

Assigning groupid across shared common columns

Tags:

r

data.table

Khaynes

2 Answers

Aurèle

ebeneditos

Related questions

Recent Activity

Donate For Us