Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assigning groupid across shared common columns

Tags:

r

data.table

I have the following data.frame (in reality it's millions of records).

I'd like to be able to assign a group id when a phone number matches across any other record and any phone number position.

id <- c(1:5)
phone_1 <- c("444","", "333", "222", "")
phone_2 <- c("", "444", "111", "", "")
phone_3 <- c("222","", "", "", "111")
df <- data.frame(id, phone_1, phone_2, phone_3)

The ideal output would be the following:

print(df)
#   id phone_1 phone_2 phone_3  ID
# 1  1     444             222  1
# 2  2             444          1
# 3  3     333     111          2
# 4  4     222                  1
# 5  5                     111  2

Ideally, I would like to use a data.table method as speed is important.

Note that missing values ("" or NA) should be treated as incomparables (or all of the above would be in the same group).

like image 576
Khaynes Avatar asked Mar 04 '23 21:03

Khaynes


2 Answers

Here is a solution finding components of a graph:

library(tidyr)
library(dplyr)
library(igraph)

df$id <- paste0("id-", df$id)

graph <- 
  df %>%
  gather(dummy, phone, -id) %>% 
  select(-dummy) %>% 
  filter(!is.na(phone) & phone != "") %>% 
  as.matrix() %>% 
  graph_from_edgelist(directed = FALSE)

plot(graph)

enter image description here

df$ID <- components(graph)$membership[df$id]

df

#     id phone_1 phone_2 phone_3 ID
# 1 id-1     444             222  1
# 2 id-2             444          1
# 3 id-3     333     111          2
# 4 id-4     222                  1
# 5 id-5                     111  2
like image 129
Aurèle Avatar answered Mar 15 '23 23:03

Aurèle


Following @Aurèle solution, if you want to compute it with data.table:

library(data.table)
library(igraph)

dt <- data.table(id, phone_1, phone_2, phone_3)
graph <- graph_from_edgelist(as.matrix(melt(dt, "id", names(dt)[grepl("phone", names(dt))])
                                       [!is.na(value) & value != ""][, variable := NULL]),
                             directed = FALSE)

dt[, ID := components(graph)$membership[dt[, as.character(id)]]]

#    id phone_1 phone_2 phone_3 ID
# 1:  1     444             222  1
# 2:  2             444          1
# 3:  3     333     111          2
# 4:  4     222                  1
# 5:  5                     111  2
like image 34
ebeneditos Avatar answered Mar 16 '23 00:03

ebeneditos