Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better Approach than FuzzyWuzzy?

I'm getting a result in fuzzywuzzy that isn't working as well as hoped. If there is an extra word in the middle, due to the levenshtein difference, the score is lower.

Example:

from fuzzywuzzy import fuzz

score = fuzz.ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

score = fuzz.partial_ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.partial_ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

results: 81 85 71 81

I'm looking for the first pair (Daniel vs. Daniel William) to be the better match than the second pair (Daniel vs. David).

Is there a better approach than fuzzywuzzy to use here?

like image 674
Caitlin G Avatar asked Jul 31 '18 23:07

Caitlin G


People also ask

Is FuzzyWuzzy slow?

FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow. The estimated time of computing similarity scores for a 406,000-entity dataset of addresses is 337 hours.

Is Fuzzy Wuzzy NLP?

FuzzyWuzzy Python Library: Interesting Tool for NLP and Text Analytics.

Is Fuzzy Matching slow?

1) - Fuzzy-matching is slow. It is not advised to do this on a large document. 2) - Only text-matches in ~tildes~ will be fuzzy. #Tags and @methods will be unaffected.

What is FuzzyWuzzy used for?

Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.


2 Answers

For your example, you could use token_set_ratio. The code doc says it takes the ratio of the intersection of the tokens and remaining tokens.

from fuzzywuzzy import fuzz

score = fuzz.token_set_ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.token_set_ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

Result:

100
85
like image 76
logee Avatar answered Oct 20 '22 18:10

logee


I had a similar challenge in using FuzzyWuzzy to compare one list of names to another list of names to identify matches between the lists. The FuzzyWuzzy token_set_ratio scorer didn't work for me because, to use your example, comparing "DANIEL CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" and "DANIEL WILLIAM CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" (partial match of 2 of 3 words vs. identity match of 3 of 3 words) both yield a 100% score. For me, a match of 3 words needed to score higher than a match of 2 of 3.

I ended up using nltk in a Bag-of-Words-like approach. The algorithm in the code below converts multi-word names to lists of distinct words (tokens) and counts matches of words in one list against the other and normalizes the counts to the numbers of words in each list. Because True = 1 and False = 0, a sum() over testing whether an element is in a list works nicely to count the elements of one list in another list.

An identity match of all words scores 1 (100%). Scoring for your comparisons works out as follows:

  • DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT = (2/2 + 2/3)/2 = (5/3)/2 = 0.83
  • DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT = (1/2 + 1/2)/2 = 1/2 = 0.5
    Note that my method ignores word order, which wasn't needed in my case.
    import nltk
    
    s1 = 'DANIEL CARTWRIGHT'
    s2 = ['DANIEL WILLIAM CARTWRIGHT', 'DAVID CARTWRIGHT']
    
    def myScore(lst1, lst2):
        # calculate score for comparing lists of words
        c = sum(el in lst1 for el in lst2)
        if (len(lst1) == 0 or len(lst2) == 0):
            retval = 0.0
        else:
            retval = 0.5 * (c/len(lst1) + c/len(lst2))
        
        return retval
    
    tokens1 = nltk.word_tokenize(s1)
    
    for s in s2:
        tokens2 = nltk.word_tokenize(s)
        score = myScore(tokens1, tokens2)
        print(' vs. '.join([s1, s]), ":", str(score))
    

    Output:

    DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT : 0.8333333333333333
    DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT : 0.5
    
  • like image 20
    BalooRM Avatar answered Oct 20 '22 17:10

    BalooRM