Is there any string distance algorithm that doesnt not take into account the order of the words?
The following algorithms do not give the desired results(in that example the desired result should be 1):
import jaro
jaro.jaro_winkler_metric(u'Michael Jordan',u'Jordan Michael')
>>>0.47
import Levenshtein
Levenshtein.ratio('Michael Jordan', 'Jordan Michael')
>>>0.5
from difflib import SequenceMatcher
SequenceMatcher(None, 'Michael Jordan', 'Jordan Michael').ratio()
>>>0.5
One way to making that is to have the string in alphabetical order and later use on of the above algorithms:
''.join(sorted('Michael Jordan'))
>>>' JMaacdehilnor'
''.join(sorted('Jordan Michael'))
>>>' JMaacdehilnor'
But here the information of the name and surname is lost and will not have 'stable' results.
I have created a function ,using permutations
from itertools
, that takes all the possible compilations of the words and compare the strings and output the max value. The results are satisfactory but the whole procedure is really slow when I have to compare millions of names.
Something else that can be done is to sort the words such as:
' '.join(sorted('Michael Jordan'.split()))
>>>'Jordan Michael'
' '.join(sorted('Jordan Michael'.split()))
>>>'Jordan Michael'
Seems quite nice way and easy way to decrease the computations but we loose some sensitive cases. example:
name1 = ' '.join(sorted('Bizen Dim'.split()))
>>>'Bizen Dim'
name2 = ' '.join(sorted('Dim Mpizen'.split()))
>>>'Dim Mpizen'
SequenceMatcher(None, name1, name2).ratio()
>>> 0.55
These two names are the same as there are cases where people 'translating' their names from 'b' to 'mp' (I am one of them). Using this way we are loosing this 'match'.
Is there any string distance algorithm that compares the words and do not take into consideration the order of the words? Or is there a recommendation how to implement efficiently the desired function?
try fuzzywuzzy
install:
pip install fuzzywuzzy
pip install python-Levenshtein
use with order not mattering:
fuzz.token_sort_ratio(u'Michael Jordan',u'Jordan Michael')
>>100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With