Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better fuzzy matching performance?

I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings:

a=['blah','pie','apple'...]
b=['jimbo','zomg','pie'...]

for value in a:
    difflib.get_close_matches(value,b,n=1,cutoff=.85)

It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for this method? I've already tried converting both arrays to lower case, but it only resulted in a slight speed increase.

like image 768
ChrisArmstrong Avatar asked Jan 28 '14 14:01

ChrisArmstrong


People also ask

How do you evaluate a fuzzy match?

One of the most effective ways to calculate scores for a fuzzy string matching algorithm is by using cosine similarity. The cosine similarity between two non-zero vectors is simply the cosine of the angle between these vectors.

Is fuzzy matching slow?

1) - Fuzzy-matching is slow. It is not advised to do this on a large document. 2) - Only text-matches in ~tildes~ will be fuzzy. #Tags and @methods will be unaffected.

How accurate is fuzzy matching?

Higher matching accuracy: fuzzy matching proves to be a far more accurate method of finding matching across two or more datasets. Unlike deterministic matching that determines matches on a 0 or 1 basis, fuzzy matching can detect variations that lie between 0 and 1 basis on a given matching threshold.

What is meant by fuzzy matching?

Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.


1 Answers

fuzzyset indexes strings by their bigrams and trigrams so it finds approximate matches in O(log(N)) vs O(N) for difflib. For my fuzzyset of 1M+ words and word-pairs it can compute the index in about 20 seconds and find the closest match in less than a 100 ms.

like image 128
hobs Avatar answered Oct 01 '22 07:10

hobs