Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match fuzzy match strings from two datasets?

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.

So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).

However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.

See the following code:

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1)) b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))  for (i in 1:6){     a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))     a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) } 
like image 386
A L Avatar asked Oct 16 '14 13:10

A L


People also ask

What is fuzzy data matching?

What is Fuzzy Matching? Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same.

What is difference between exact match and fuzzy match?

You can use the exact matching method for almost any field, including custom fields. The fuzzy matching methods look for strings that approximately match a pattern. Some fuzzy matching methods, such as Acronym and Name Variant, identify similarities using hard-coded dictionaries.

Does SQL match fuzzy?

You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches.


2 Answers

Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.

As suggested by @C8H10N4O2, the stringdist method="jw" creates the best matches for your example.

As suggested by @dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::slice_min to get only the best match with minimum distance. (slice_min replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1))


a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),                 price = c(10, 13, 2, 1, 15, 1)) b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),                 qty = c(9, 99, 10))  library(fuzzyjoin); library(dplyr);  stringdist_join(a, b,                  by = "name",                 mode = "left",                 ignore_case = FALSE,                  method = "jw",                  max_dist = 99,                  distance_col = "dist") %>%   group_by(name.x) %>%   slice_min(order_by = dist, n = 1)  #> # A tibble: 6 x 5 #> # Groups:   name.x [6] #>   name.x price     name.y   qty       dist #>   <fctr> <dbl>     <fctr> <dbl>      <dbl> #> 1 Ace Co    10    Ace Co.     9 0.04761905 #> 2  Bayes    13 Bayes Inc.    99 0.16666667 #> 3    asd     2       asdf    10 0.08333333 #> 4    Bcy     1 Bayes Inc.    99 0.37777778 #> 5   Baes    15 Bayes Inc.    99 0.20000000 #> 6   Bays     1 Bayes Inc.    99 0.20000000 
like image 156
Arthur Yip Avatar answered Sep 19 '22 04:09

Arthur Yip


The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist) d <- expand.grid(a$name,b$name) # Distance matrix in long form names(d) <- c("a_name","b_name") d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)  # Greedy assignment heuristic (Your favorite heuristic here) greedyAssign <- function(a,b,d){   x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,    # 1 for already assigned, -1 for unassigned and unassignable   while(any(x==0)){     min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs     a_sel <- a[d==min_d & x==0][1]      b_sel <- b[d==min_d & a == a_sel & x==0][1]      x[a==a_sel & b == b_sel] <- 1     x[x==0 & (a==a_sel|b==b_sel)] <- -1   }   cbind(a=a[x==1],b=b[x==1],d=d[x==1]) } data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist)) 

Produces the assignment:

       a          b       d 1 Ace Co    Ace Co. 0.04762 2  Bayes Bayes Inc. 0.16667 3    asd       asdf 0.08333 

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),]))) 

Produces the result:

   a_name     b_name    dist 1  Ace Co    Ace Co. 0.04762 11   Baes Bayes Inc. 0.20000 8   Bayes Bayes Inc. 0.16667 12   Bays Bayes Inc. 0.20000 10    Bcy Bayes Inc. 0.37778 15    asd       asdf 0.08333 

Edit: use method="jw" to produce desired results. See help("stringdist-package")

like image 45
C8H10N4O2 Avatar answered Sep 18 '22 04:09

C8H10N4O2