I have two tables containing addresses (street, city, zipcode and two fields containing concatenated values of these), I would like to do fuzzy matching on Zipcode, but only for those cases which have exact same StrCity value. I have started with first selecting only addresses that are matching the StrCity from dictionary and then fuzzy matching, but there are two problems:
1) if matched by Zipcode, it doesn't take into account the street and city 2) if matched by Address (containing all of Zipcode, Street and City), it returns also possible values, where on the same zipcode there is another street that is close enough in terms of distance.
Probably I need something like doing two different matches at the same time (one fuzzy and one exact), but I am not sure how to implement it, while not killing my computer performance-wise.
Here is data sample of TableAd:
StrCity ID Zipcode Street City Address
BiałowiejskaWarszawa 5148676 01-459 Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa 6423687 02-730 Bukowińska Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa 6425093 03-536 Kanałowa Warszawa 03-536KanałowaWarszawa
And the dictionary sample:
Zipcode Street City Address StrCity
02-882 Agaty Warszawa 02-882AgatyWarszawa AgatyWarszawa
03-663 Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536 Kołowa Warszawa 03-536KołowaWarszawa KołowaWarszawa
Here is my current code:
TableMatch <- merge(TableAd, TableDict, by="StrCity")
TableMatch <- TableMatch[, -grep("y", colnames(TableMatch))]
names(TableMatch)[names(TableMatch)=="Zipcode.x"] <- "Zipcode"
names(TableMatch)[names(TableMatch)=="Address.x"] <- "Address"
ResultTable <- TableMatch %>%
stringdist_left_join(TableDict, by="Address", distance_col="dist", method="lv", max_dist=5, ignore_case = TRUE) %>%
select(ID, Zipcode.x, Address.x, Address.y, dist) %>%
group_by(Address.x) %>%
# select best fit record
top_n(-1, dist)
The problem I found specifically with an example provided above - the script verifies that strCity KanałowaWarszawa is present in dictionary, but Levenshtein distance of combined Address string is the same when changing the zipcode as when changing the street to Kołowa, which has the same zipcode as the one inspected. Here it returns both changes, but if there would be just 2 or 1 digits difference in zipcode, then it might incorrectly suggest replacing the street while zipcode should be changed.
Note: I am using packages purrr
, dplyr
and fuzzyjoin
.
Here is a way to make it work, using regular fuzzyjoin
functions, that are more flexible :
data
TableAd <- read.table(h=T,strin=F,text="StrCity ID Zipcode Street City Address
BiałowiejskaWarszawa 5148676 01-459 Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa 6423687 02-730 Bukowińska Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa 6425093 03-536 Kanałowa Warszawa 03-536KanałowaWarszawa")
TableDict <- read.table(h=T,strin=F,text="Zipcode Street City StrCity
02-882 Agaty Warszawa 02-882AgatyWarszawa AgatyWarszawa
03-663 Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536 Kołowa Warszawa 03-536KołowaWarszawa KołowaWarszawa")
solution
library(fuzzyjoin)
library(stringdist)
res <- fuzzy_left_join(
TableAd,
TableDict,
by=c("StrCity","Zipcode"),
list(`==`, function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 5)
)
res %>%
select(StrCity = StrCity.x, everything(), - StrCity.y)
# StrCity ID Zipcode.x Street.x City.x Address.x Zipcode.y Street.y City.y Address.y
# 1 BialowiejskaWarszawa 5148676 01-459 Bialowiejska Warszawa 01-459BialowiejskaWarszawa <NA> <NA> <NA> <NA>
# 2 BukowinskaWarszawa 6423687 02-730 Bukowinska Warszawa 02-730BukowinskaWarszawa <NA> <NA> <NA> <NA>
# 3 KanalowaWarszawa 6425093 03-536 Kanalowa Warszawa 03-536KanalowaWarszawa 03-663 Kanalowa Warszawa 03-663KanalowaWarszawa
The problem of the above solution is that it makes a cartesian product internally, which might be problematic if you have a lot of data. The impact is diminished by the fact you're joining on concatenated strings but it feels like a hack that would be better avoided.
A way to sort this out would be to apply the fuzzy join on pairs of subsets determined by the exact matches, we define a function below to do that, along with enhanced sample data.
data
TableAd2 <- read.table(h=T,strin=F,text="ID Zipcode Street City
5148676 01-459 Białowiejska Warszawa
6423687 02-730 Bukowińska Warszawa
6423687 99-999 Agaty Warszawa
6423687 02-883 Agaty Warszawa
6425093 03-536 Kanałowa Warszawa")
TableDict2 <- read.table(h=T,strin=F,text="Zipcode Street City
02-882 Agaty Warszawa
03-663 Kanałowa Warszawa
03-536 Kołowa Warszawa
02-730 Bukowińska Warszawa")
function
fuzzy_inner_join2 <- function(x,y,by, match_fun, ...){
match_fun_equal_lgl <- sapply(match_fun, identical, `==`)
# columns to use for exact join equivalent
by_exact = by[match_fun_equal_lgl]
# columns to use for fuzzy join on relevant subsets of data (for efficiency)
by_fuzzy = by[!match_fun_equal_lgl]
# update match_fun
match_fun <- match_fun[!match_fun_equal_lgl]
# trim inputs of irrelevant data
x <- dplyr::semi_join(x,y,by= by_exact)
y <- dplyr::semi_join(y,x,by= by_exact)
# make lists so we have pairs of data frames to fuzzy join together
x_list <- dplyr::group_split(dplyr::group_by_at(x, by_exact))
y_list <- dplyr::group_split(dplyr::group_by_at(y, by_exact), keep = FALSE)
# apply fuzzy join on pairs and bind the results
map2_dfr(x_list,y_list, fuzzyjoin::fuzzy_inner_join, match_fun = match_fun,
by = by_fuzzy, ...)
}
solution
fuzzy_inner_join2(
TableAd2,
TableDict2,
by=c("City","Street","Zipcode"),
match_fun = list(
`==`, `==`,
function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 3)
)
# # A tibble: 3 x 5
# ID Zipcode.x Street City Zipcode.y
# <int> <chr> <chr> <chr> <chr>
# 1 6423687 02-883 Agaty Warszawa 02-882
# 2 6423687 02-730 Bukowinska Warszawa 02-730
# 3 6425093 03-536 Kanalowa Warszawa 03-663
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With