I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.
So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).
However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.
See the following code:
a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1)) b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10)) for (i in 1:6){ a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) }
What is Fuzzy Matching? Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same.
You can use the exact matching method for almost any field, including custom fields. The fuzzy matching methods look for strings that approximately match a pattern. Some fuzzy matching methods, such as Acronym and Name Variant, identify similarities using hard-coded dictionaries.
You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches.
Here is a solution using the fuzzyjoin
package. It uses dplyr
-like syntax and stringdist
as one of the possible types of fuzzy matching.
As suggested by @C8H10N4O2, the stringdist
method="jw" creates the best matches for your example.
As suggested by @dgrtwo, the developer of fuzzyjoin
, I used a large max_dist
and then used dplyr::group_by
and dplyr::slice_min
to get only the best match with minimum distance. (slice_min
replaces the older top_n
and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1)
)
a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'), price = c(10, 13, 2, 1, 15, 1)) b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'), qty = c(9, 99, 10)) library(fuzzyjoin); library(dplyr); stringdist_join(a, b, by = "name", mode = "left", ignore_case = FALSE, method = "jw", max_dist = 99, distance_col = "dist") %>% group_by(name.x) %>% slice_min(order_by = dist, n = 1) #> # A tibble: 6 x 5 #> # Groups: name.x [6] #> name.x price name.y qty dist #> <fctr> <dbl> <fctr> <dbl> <dbl> #> 1 Ace Co 10 Ace Co. 9 0.04761905 #> 2 Bayes 13 Bayes Inc. 99 0.16666667 #> 3 asd 2 asdf 10 0.08333333 #> 4 Bcy 1 Bayes Inc. 99 0.37777778 #> 5 Baes 15 Bayes Inc. 99 0.20000000 #> 6 Bays 1 Bayes Inc. 99 0.20000000
The solution depends on the desired cardinality of your matching a
to b
. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
One-to-one case (requires assignment algorithm):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim
.
Not familiar with AGREP but here's example using stringdist
for your distance matrix.
library(stringdist) d <- expand.grid(a$name,b$name) # Distance matrix in long form names(d) <- c("a_name","b_name") d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here) # Greedy assignment heuristic (Your favorite heuristic here) greedyAssign <- function(a,b,d){ x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, # 1 for already assigned, -1 for unassigned and unassignable while(any(x==0)){ min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs a_sel <- a[d==min_d & x==0][1] b_sel <- b[d==min_d & a == a_sel & x==0][1] x[a==a_sel & b == b_sel] <- 1 x[x==0 & (a==a_sel|b==b_sel)] <- -1 } cbind(a=a[x==1],b=b[x==1],d=d[x==1]) } data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
a b d 1 Ace Co Ace Co. 0.04762 2 Bayes Bayes Inc. 0.16667 3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
Many-to-one case (not an assignment problem):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
a_name b_name dist 1 Ace Co Ace Co. 0.04762 11 Baes Bayes Inc. 0.20000 8 Bayes Bayes Inc. 0.16667 12 Bays Bayes Inc. 0.20000 10 Bcy Bayes Inc. 0.37778 15 asd asdf 0.08333
Edit: use method="jw"
to produce desired results. See help("stringdist-package")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With