Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding one-to-one, one-to-many, and many-to-one relationships between columns

Tags:

r

dplyr

Consider the following data frame:

 first_name last_name
1         Al     Smith
2         Al     Jones
3       Jeff  Thompson
4      Scott  Thompson
5      Terry    Dactil
6       Pete       Zah

data <- data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah"))

In this data frame, there are three ways that first_name is related to last_name:

  • One to one (i.e. there is a unique relationship between first_name and last_name)
  • One to many (i.e. one first_name points to multiple last_name values)
  • Many to one (i.e. multiple first_name values point to one last_name)

I want to be able to quickly identify each of the three cases and output them to a data frame. So, the resulting data frames would be:

One to one

  first_name last_name
1      Terry    Dactil
2       Pete       Zah

One to many

  first_name last_name
1         Al     Smith
2         Al     Jones

Many to one

   first_name last_name
1       Jeff  Thompson
2      Scott  Thompson

I would like to do this within the dplyr package.

like image 838
glonn Avatar asked Feb 09 '23 22:02

glonn


1 Answers

In general, you can check if a value is duplicated using the duplicated function (as mentioned by @RichardScriven in a comment on your question). However, by default this function doesn't mark the first instance of an element that appears multiple times as duplicated:

duplicated(c(1, 1, 1, 2))
# [1] FALSE  TRUE  TRUE FALSE

Since you also want to pick up these cases, you generally would want to run duplicated on each vector twice, once forward and once backwards:

duplicated(c(1, 1, 1, 2)) | duplicated(c(1, 1, 1, 2), fromLast=TRUE)
# [1]  TRUE  TRUE  TRUE FALSE

I find this to be a lot of typing, so I'll define a helper function that checks if an element appears more than once:

d <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)

Now the logic you want is all simple one-liners:

# One to one
data[!d(data$first_name) & !d(data$last_name),]
#   first_name last_name
# 5      Terry    Dactil
# 6       Pete       Zah

# One to many
data[d(data$first_name) & !d(data$last_name),]
#   first_name last_name
# 1         Al     Smith
# 2         Al     Jones

# Many to one
data[!d(data$first_name) & d(data$last_name),]
#   first_name last_name
# 3       Jeff  Thompson
# 4      Scott  Thompson

Note that you could also define d without the help of duplicated using the table function:

d <- function(x) table(x)[x] > 1

While this alternate definition is slightly more succinct, I also find it less readable.

like image 142
josliber Avatar answered Feb 11 '23 22:02

josliber