Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithms for computing the distance between two strings

Is there any string distance algorithm that doesnt not take into account the order of the words?

The following algorithms do not give the desired results(in that example the desired result should be 1):

import jaro
jaro.jaro_winkler_metric(u'Michael Jordan',u'Jordan Michael')
>>>0.47

import Levenshtein
Levenshtein.ratio('Michael Jordan', 'Jordan Michael')
>>>0.5

from difflib import SequenceMatcher
SequenceMatcher(None, 'Michael Jordan', 'Jordan Michael').ratio()
>>>0.5

One way to making that is to have the string in alphabetical order and later use on of the above algorithms:

''.join(sorted('Michael Jordan'))
>>>' JMaacdehilnor'

''.join(sorted('Jordan Michael'))
>>>' JMaacdehilnor'

But here the information of the name and surname is lost and will not have 'stable' results.

I have created a function ,using permutations from itertools, that takes all the possible compilations of the words and compare the strings and output the max value. The results are satisfactory but the whole procedure is really slow when I have to compare millions of names.

Something else that can be done is to sort the words such as:

' '.join(sorted('Michael Jordan'.split()))
>>>'Jordan Michael'
' '.join(sorted('Jordan Michael'.split()))
>>>'Jordan Michael'

Seems quite nice way and easy way to decrease the computations but we loose some sensitive cases. example:

name1 = ' '.join(sorted('Bizen Dim'.split()))
>>>'Bizen Dim'
name2 = ' '.join(sorted('Dim Mpizen'.split()))
>>>'Dim Mpizen'

SequenceMatcher(None, name1, name2).ratio()
>>>  0.55

These two names are the same as there are cases where people 'translating' their names from 'b' to 'mp' (I am one of them). Using this way we are loosing this 'match'.

Is there any string distance algorithm that compares the words and do not take into consideration the order of the words? Or is there a recommendation how to implement efficiently the desired function?

like image 665
Mpizos Dimitris Avatar asked Oct 18 '22 15:10

Mpizos Dimitris


1 Answers

try fuzzywuzzy

install:

pip install fuzzywuzzy
pip install python-Levenshtein

use with order not mattering:

fuzz.token_sort_ratio(u'Michael Jordan',u'Jordan Michael')
>>100
like image 81
Roman Avatar answered Nov 11 '22 08:11

Roman