Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fuzzy string matching with term weights

I'm working on an application that attempts to match an input set of potentially "messy" entity names to "clean" entity names in a reference list. I've been working with edit distance and other common fuzzy matching algorithms, but I'm wondering if there are any better approaches that allow for term weighting, such that common terms are given less weight in the fuzzy match.

Consider this example, using Python's difflib library. I'm working with organization names, which have many standardized components in common and therefore cannot be used to differentiate among entities.

from difflib import SequenceMatcher  
e1a = SequenceMatcher(None, "ZOECON RESEARCH INSTITUTE", 
                            "LONDON RESEARCH INSTITUTE")
print e1a.ratio()
0.88

e1b = SequenceMatcher(None, "ZOECON", "LONDON")
print e1b.ratio() 
0.333333333333

e2a = SequenceMatcher(None, "WORLDWIDE SEMICONDUCTOR MANUFACTURING CORP",
                            "TAIWAN SEMICONDUCTOR MANUFACTURING CORP")
print e2a.ratio() 
0.83950617284

e2b = SequenceMatcher(None, "WORLDWIDE",
                            "TAIWAN")
print e2b.ratio() 
0.133333333333

Both examples score highly on the full string because RESEARCH, INSTITUTE, SEMICONDUCTOR, MANUFACTURING, and CORP are high frequency, generic terms in many organization names. I'm looking for any ideas of how to integrate term frequencies into fuzzy string matching (not necessarily using difflib), such that the scores are't as influenced by common terms, and the results might look more like the "e1b" and "e2b" examples.

I realize I could just make a big "frequent term" list and exclude those from the comparison, but I'd like to use frequencies if possible because even common words add some information, and also the cutoff point for any list would of course also be arbitrary.

like image 609
rjf Avatar asked Oct 06 '12 16:10

rjf


People also ask

What is fuzzy data matching?

Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.

How do you match fuzzy strings in R?

Often you may want to join together two datasets in R based on imperfectly matching strings. This is sometimes called fuzzy matching. The easiest way to perform fuzzy matching in R is to use the stringdist_join() function from the fuzzyjoin package.

Can you do fuzzy matching in Excel?

The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables.


2 Answers

Here's a weird idea for you:

Compress your input and diff that.

You could use e.g. Huffman or dictionary coder to compress your input, that automatically takes care of common terms. It may not do so well for typos though, in your example, London is probably a relatively common word, while misspelt Lundon is not at all, and dissimilarity between compressed terms is much higher than between raw terms.

like image 63
Dima Tisnek Avatar answered Sep 25 '22 11:09

Dima Tisnek


how about splitting each string into a list of words, and running your comparison on each word to get a list which holds the scores of word matches. then you can average the scores, find the lowest/highest indirect match or partials...

gives you the ability to add your own weight.

you would of course need to handle offsets like..

"the london company for leather"

and

"london company for leather"

like image 23
Inbar Rose Avatar answered Sep 24 '22 11:09

Inbar Rose