Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the most similar word in a list in python

Tags:

python

I have a list of words

list = ['car', 'animal', 'house', 'animation']

and I want to compare every list item with a string str1 and the output should be the most similar word. Example: If str1 would be anlmal then animal is the most similar word. How can I do this in python? Usually the words I have in my list are good distinguishable from each other.

like image 438
JohnB Avatar asked Oct 09 '14 16:10

JohnB


People also ask

How do you find similar strings in Python?

Similarity of strings is being checked on the criteria of frequency difference of each character which should be greater than a threshold here represented by K. Explanation : 'a' occurs 4 times in str1, and 2 times in str2, 4 – 2 = 2, in range, similarly, all chars in range, hence true.


2 Answers

Use difflib:

difflib.get_close_matches(word, ['car', 'animal', 'house', 'animation'])

As you can see from perusing the source, the "close" matches are sorted from best to worst.

>>> import difflib
>>> difflib.get_close_matches('anlmal', ['car', 'animal', 'house', 'animation'])
['animal']
like image 114
mgilson Avatar answered Nov 03 '22 11:11

mgilson


I checked difflib.get_close_matches(), but it didn't work for me correctly. I write here a robust solution, use as:

closest_match, closest_match_idx = find_closet_match(test_str, list2check)

def find_closet_match(test_str, list2check):
scores = {}
for ii in list2check:
    cnt = 0
    if len(test_str)<=len(ii):
        str1, str2 = test_str, ii
    else:
        str1, str2 = ii, test_str
    for jj in range(len(str1)):
        cnt += 1 if str1[jj]==str2[jj] else 0
    scores[ii] = cnt
scores_values        = numpy.array(list(scores.values()))
closest_match_idx    = numpy.argsort(scores_values, axis=0, kind='quicksort')[-1]
closest_match        = numpy.array(list(scores.keys()))[closest_match_idx]
return closest_match, closest_match_idx
like image 3
amit Avatar answered Nov 03 '22 11:11

amit