I'm getting a result in fuzzywuzzy that isn't working as well as hoped. If there is an extra word in the middle, due to the levenshtein difference, the score is lower.
Example:
from fuzzywuzzy import fuzz
score = fuzz.ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)
score = fuzz.partial_ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.partial_ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)
results: 81 85 71 81
I'm looking for the first pair (Daniel vs. Daniel William) to be the better match than the second pair (Daniel vs. David).
Is there a better approach than fuzzywuzzy to use here?
FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow. The estimated time of computing similarity scores for a 406,000-entity dataset of addresses is 337 hours.
FuzzyWuzzy Python Library: Interesting Tool for NLP and Text Analytics.
1) - Fuzzy-matching is slow. It is not advised to do this on a large document. 2) - Only text-matches in ~tildes~ will be fuzzy. #Tags and @methods will be unaffected.
Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.
For your example, you could use token_set_ratio
. The code doc says it takes the ratio of the intersection of the tokens and remaining tokens.
from fuzzywuzzy import fuzz
score = fuzz.token_set_ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.token_set_ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)
Result:
100
85
I had a similar challenge in using FuzzyWuzzy to compare one list of names to another list of names to identify matches between the lists. The FuzzyWuzzy token_set_ratio scorer didn't work for me because, to use your example, comparing "DANIEL CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" and "DANIEL WILLIAM CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" (partial match of 2 of 3 words vs. identity match of 3 of 3 words) both yield a 100% score. For me, a match of 3 words needed to score higher than a match of 2 of 3.
I ended up using nltk in a Bag-of-Words-like approach. The algorithm in the code below converts multi-word names to lists of distinct words (tokens) and counts matches of words in one list against the other and normalizes the counts to the numbers of words in each list. Because True = 1 and False = 0, a sum() over testing whether an element is in a list works nicely to count the elements of one list in another list.
An identity match of all words scores 1 (100%). Scoring for your comparisons works out as follows:
import nltk
s1 = 'DANIEL CARTWRIGHT'
s2 = ['DANIEL WILLIAM CARTWRIGHT', 'DAVID CARTWRIGHT']
def myScore(lst1, lst2):
# calculate score for comparing lists of words
c = sum(el in lst1 for el in lst2)
if (len(lst1) == 0 or len(lst2) == 0):
retval = 0.0
else:
retval = 0.5 * (c/len(lst1) + c/len(lst2))
return retval
tokens1 = nltk.word_tokenize(s1)
for s in s2:
tokens2 = nltk.word_tokenize(s)
score = myScore(tokens1, tokens2)
print(' vs. '.join([s1, s]), ":", str(score))
Output:
DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT : 0.8333333333333333
DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT : 0.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With