How to group words whose Levenshtein distance is more than 80 percent in Python

Tags:

Suppose I have a list:-

person_name = ['zakesh', 'oldman LLC', 'bikash', 'goldman LLC', 'zikash','rakesh']

I am trying to group the list in such a way so the Levenshtein distance between two strings is maximum. For finding out the ratio between two words, I am using a python package fuzzywuzzy.

Examples :-

>>> from fuzzywuzzy import fuzz
>>> combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
>>> fuzz.ratio('goldman LLC', 'oldman LLC')
95
>>> fuzz.ratio('rakesh', 'zakesh')
83
>>> fuzz.ratio('bikash', 'zikash')
83
>>>

My end goal:

My end goal is to group the words such that Levenshtein distance between them is more than 80 percent?

My list should look something like this :-

person_name = ['bikash', 'zikash', 'rakesh', 'zakesh', 'goldman LLC', 'oldman LLC'] because the distance between `bikash` and `zikash` is very high so they should be together.

Code:

I am trying to achieve this by sorting but key function should be fuzz.ratio. Well below code is not working, but I am approaching the problem in this angle.

from fuzzywuzzy import fuzz
combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.sort(key=lambda x, y: fuzz.ratio(x, y))
print combined_list

Could anyone help me to combine the words so that Levenshtein distance between them is more than 80 percent?

650

asked Feb 03 '16 08:02

python

1 Answers

This groups the names

from fuzzywuzzy import fuzz

combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.append('bakesh')
print('input names:', combined_list)

grs = list() # groups of names with distance > 80
for name in combined_list:
    for g in grs:
        if all(fuzz.ratio(name, w) > 80 for w in g):
            g.append(name)
            break
    else:
        grs.append([name, ])

print('output groups:', grs)
outlist = [el for g in grs for el in g]
print('output list:', outlist)

producing

input names: ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC', 'bakesh']
output groups: [['rakesh', 'zakesh', 'bakesh'], ['bikash', 'zikash'], ['goldman LLC', 'oldman LLC']]
output list: ['rakesh', 'zakesh', 'bakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']

As you can see, the names are grouped correctly, but the order may not be the one you desire.

106

answered Oct 13 '22 17:10

Pynchia

Related questions
                            
                                How can I add a comment to a YAML file in Python
                            
                                python manage.py runserver, shell, dbshell freezes on git-bash
                            
                                Python: Why is popping off a queue faster than for-in block?
                            
                                pip fails to install anything, error: invalid command 'egg_info'
                            
                                How to find the 1st, 2nd, 3rd highest values in a list in Python
                            
                                Selenium wait.until to check ajax request finished is throw error
                            
                                Equivalent function of datenum(datestring) of Matlab in Python
                            
                                Read in an indexed color image in Python
                            
                                Python ctypes import error in virtualenv
                            
                                Stopword removal with NLTK and Pandas
                            
                                quote_plus URL-encode filter in Jinja2
                            
                                How do I log multiple very similar events gracefully in python?
                            
                                Leave dates as strings using read_excel function from pandas in python
                            
                                ImportError: No module named cycler
                            
                                Python Killed: 9 when running a code using dictionaries created from 2 csv files
                            
                                Caffe: how to get the phase of a Python layer?
                            
                                Writable nested serializer in django-rest-framework?
                            
                                Interpolated sampling of points in an image with TensorFlow
                            
                                Make a column "immutable" in SQLAlchemy
                            
                                Element is not currently visible and so may not be interacted with, Selenium Dropdown Box Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to group words whose Levenshtein distance is more than 80 percent in Python

Tags:

python

group-by

levenshtein-distance

fuzzy-search

fuzzy-logic

python

People also ask

1 Answers

Pynchia

Recent Activity

Donate For Us