I have two dataframes I want to join. They share two fields: group_id
and person_name
. I want to join exactly on group_id
and fuzzy on person_name
. How can I do this?
Constraints:
group_id
exactly and person_name
fuzzy must appear in both the left and right frames.Here is a small example:
a = data.frame(
group_id=c(1,2,2,3,3,3),
person_name=c('Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'),
eye_color=c('brown', 'green', 'blue', 'brown', 'green', 'blue')
)
b = data.frame(
group_id=c(2,2,2,3,3,3,3),
person_name=c('Alie', 'Bobo', 'Charles', 'Charlie', 'Davis', 'Eva', 'Zed' ),
hair_color=c('brown', 'brown', 'black', 'grey', 'brown', 'black', 'blond')
)
expected = data.frame(
group_id=c(2,2,3,3),
person_name_x=c('Bob', 'Charlie', 'David', 'Eve'),
person_name_y=c('Bobo', 'Charles', 'Davis', 'Eva'),
eye_color=c('green', 'blue', 'brown', 'green'),
hair_color=c('brown', 'black', 'brown', 'black')
)
You could try
library(RecordLinkage)
library(tidyverse)
compare.linkage(a, b, strcmp = 2, exclude=3, blockfld = 1) %>%
epiWeights %>%
epiClassify(.8) %>%
getPairs(show="links", single.rows=T) %>%
.[(c(2,3,7,4,8))]
# group_id.1 person_name.1 person_name.2 eye_color.1 hair_color.2
# 3 2 Charlie Charles blue black
# 2 2 Bob Bobo green brown
# 4 3 David Davis brown brown
# 5 3 Eve Eva green black
In this example, we basically need a hybrid join. For one column (group_id
), we need an exact match of column names whereas for the other column (person_name
) we need a fuzzy join.
One way to do this :
library(fuzzyjoin)
common_id <- intersect(a$group_id, b$group_id)
stringdist_inner_join(a[a$group_id %in% common_id, ], b[b$group_id %in% common_id, ],
by = "person_name")
# group_id.x person_name.x eye_color group_id.y person_name.y hair_color
# <dbl> <fctr> <fctr> <dbl> <fctr> <fctr>
#1 2 Bob green 2 Bobo Brown
#2 2 Charlie blue 2 Charles Black
#3 3 David brown 3 Davis Brown
#4 3 Eve green 3 Eva Black
Here, we first find those common group_id
's using intersect
which are present in both the dataframes and filter them accordingly from a
and b
and then use stringdist_inner_join
function on only person_name
column. We can later remove the extra group_id
column which has been generated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With