How can I match fuzzy match strings from two datasets?

Tags:

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.

So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).

However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.

See the following code:

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1)) b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))  for (i in 1:6){     a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))     a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) }

386

asked Oct 16 '14 13:10

A L

2 Answers

Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.

As suggested by @C8H10N4O2, the stringdist method="jw" creates the best matches for your example.

As suggested by @dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::slice_min to get only the best match with minimum distance. (slice_min replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1))

a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),                 price = c(10, 13, 2, 1, 15, 1)) b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),                 qty = c(9, 99, 10))  library(fuzzyjoin); library(dplyr);  stringdist_join(a, b,                  by = "name",                 mode = "left",                 ignore_case = FALSE,                  method = "jw",                  max_dist = 99,                  distance_col = "dist") %>%   group_by(name.x) %>%   slice_min(order_by = dist, n = 1)  #> # A tibble: 6 x 5 #> # Groups:   name.x [6] #>   name.x price     name.y   qty       dist #>   <fctr> <dbl>     <fctr> <dbl>      <dbl> #> 1 Ace Co    10    Ace Co.     9 0.04761905 #> 2  Bayes    13 Bayes Inc.    99 0.16666667 #> 3    asd     2       asdf    10 0.08333333 #> 4    Bcy     1 Bayes Inc.    99 0.37777778 #> 5   Baes    15 Bayes Inc.    99 0.20000000 #> 6   Bays     1 Bayes Inc.    99 0.20000000

156

answered Sep 19 '22 04:09

Arthur Yip

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist) d <- expand.grid(a$name,b$name) # Distance matrix in long form names(d) <- c("a_name","b_name") d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)  # Greedy assignment heuristic (Your favorite heuristic here) greedyAssign <- function(a,b,d){   x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,    # 1 for already assigned, -1 for unassigned and unassignable   while(any(x==0)){     min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs     a_sel <- a[d==min_d & x==0][1]      b_sel <- b[d==min_d & a == a_sel & x==0][1]      x[a==a_sel & b == b_sel] <- 1     x[x==0 & (a==a_sel|b==b_sel)] <- -1   }   cbind(a=a[x==1],b=b[x==1],d=d[x==1]) } data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d 1 Ace Co    Ace Co. 0.04762 2  Bayes Bayes Inc. 0.16667 3    asd       asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist 1  Ace Co    Ace Co. 0.04762 11   Baes Bayes Inc. 0.20000 8   Bayes Bayes Inc. 0.16667 12   Bays Bayes Inc. 0.20000 10    Bcy Bayes Inc. 0.37778 15    asd       asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

answered Sep 18 '22 04:09

C8H10N4O2

Related questions
                            
                                R: legend with points and lines being different colors (for the same legend item)
                            
                                NOTE in R CRAN Check: No repository set, so cyclic dependency check skipped
                            
                                There is pmin and pmax each taking na.rm, why no psum?
                            
                                Check if character string is a valid color representation
                            
                                What are the differences between concatenating strings with cat() and paste()?
                            
                                Implementation of standard recycling rules
                            
                                What constitutes a good package name according to CRAN? [closed]
                            
                                Why does the number 1e9999... (31 9s) cause problems in R?
                            
                                Is there a more efficient way to replace NULL with NA in a list?
                            
                                Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)
                            
                                How (and why) do you use contrasts?
                            
                                Dynamic height and width for knitr plots
                            
                                Dealing with Messy Dates
                            
                                How can I sort a data.frame with only one column, without losing rownames?
                            
                                Solving simultaneous equations with R
                            
                                Why is the terminology of labels and levels in factors so weird?
                            
                                caught segfault error in R
                            
                                What is the practical use of the identity function in R?
                            
                                Is there a way to use two '...' statements in a function in R?
                            
                                Aesthetics must either be length one, or the same length as the dataProblems

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I match fuzzy match strings from two datasets?

Tags:

string-matching

r

fuzzy-comparison

fuzzy-search

A L

People also ask

2 Answers

Arthur Yip

C8H10N4O2

Recent Activity

Donate For Us