Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group words whose Levenshtein distance is more than 80 percent in Python

Suppose I have a list:-

person_name = ['zakesh', 'oldman LLC', 'bikash', 'goldman LLC', 'zikash','rakesh']

I am trying to group the list in such a way so the Levenshtein distance between two strings is maximum. For finding out the ratio between two words, I am using a python package fuzzywuzzy.

Examples :-

>>> from fuzzywuzzy import fuzz
>>> combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
>>> fuzz.ratio('goldman LLC', 'oldman LLC')
95
>>> fuzz.ratio('rakesh', 'zakesh')
83
>>> fuzz.ratio('bikash', 'zikash')
83
>>> 

My end goal:

My end goal is to group the words such that Levenshtein distance between them is more than 80 percent?

My list should look something like this :-

person_name = ['bikash', 'zikash', 'rakesh', 'zakesh', 'goldman LLC', 'oldman LLC'] because the distance between `bikash` and `zikash` is very high so they should be together.

Code:

I am trying to achieve this by sorting but key function should be fuzz.ratio. Well below code is not working, but I am approaching the problem in this angle.

from fuzzywuzzy import fuzz
combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.sort(key=lambda x, y: fuzz.ratio(x, y))
print combined_list

Could anyone help me to combine the words so that Levenshtein distance between them is more than 80 percent?

like image 650
python Avatar asked Feb 03 '16 08:02

python


People also ask

How does Python calculate Levenshtein distance?

Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2.

What is the algorithm to find the edit distance between two words?

For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2. 2. Similarly, for the second case of inserting last character of str2 into str1, we have to do - (1 + array[m][n-1]).

How do you read Levenshtein distance?

The Levenshtein distance is a number that tells you how different two strings are. The higher the number, the more different the two strings are. For example, the Levenshtein distance between “kitten” and “sitting” is 3 since, at a minimum, 3 edits are required to change one into the other.

What are three major operations of Levenshtein edit distance?

Most commonly, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string and (iii) replace a character of a string by another character; for these operations, edit distance is sometimes known as Levenshtein distance .


1 Answers

This groups the names

from fuzzywuzzy import fuzz

combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.append('bakesh')
print('input names:', combined_list)

grs = list() # groups of names with distance > 80
for name in combined_list:
    for g in grs:
        if all(fuzz.ratio(name, w) > 80 for w in g):
            g.append(name)
            break
    else:
        grs.append([name, ])

print('output groups:', grs)
outlist = [el for g in grs for el in g]
print('output list:', outlist)

producing

input names: ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC', 'bakesh']
output groups: [['rakesh', 'zakesh', 'bakesh'], ['bikash', 'zikash'], ['goldman LLC', 'oldman LLC']]
output list: ['rakesh', 'zakesh', 'bakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']

As you can see, the names are grouped correctly, but the order may not be the one you desire.

like image 106
Pynchia Avatar answered Oct 13 '22 17:10

Pynchia