Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: find closest string (from a list) to another string

Let's say I have a string "Hello" and a list

words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format'] 

How can I find the n words that are the closest to "Hello" and present in the list words ?

In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]

So the strategy is to sort the list words from the closest word to the furthest.

I thought about something like this

word = 'Hello' for i, item in enumerate(words):     if lower(item) > lower(word):       ... 

but it's very slow in large lists.

UPDATE difflib works but it's very slow also. (words list has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!

like image 454
Laura Avatar asked Apr 04 '12 20:04

Laura


People also ask

How do I find the closest value in a list Python?

We can find the nearest value in the list by using the min() function. Define a function that calculates the difference between a value in the list and the given value and returns the absolute value of the result. Then call the min() function which returns the closest value to the given value.

How do you close a match in Python?

Python has a built-in package called difflib with the function get_close_matches() that can help us. get_close_matches(word, possibilities, n, cutoff) accepts four parameters: word - the word to find close matches for in our list. possibilities - the list in which to search for close matches of word.


2 Answers

Use difflib.get_close_matches.

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format'] >>> difflib.get_close_matches('Hello', words) ['hello', 'Hallo', 'hallo'] 

Please look at the documentation, because the function returns 3 or less closest matches by default.

like image 173
Oleh Prypin Avatar answered Sep 28 '22 04:09

Oleh Prypin


There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.

http://norvig.com/spell-correct.html

The idea is to build all possible edits of your word,

hello - helo   - deletes     hello - helol  - transpose     hello - hallo  - replaces     hello - heallo - inserts       def edits1(word):    splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in splits if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]    replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]    inserts    = [a + c + b     for a, b in splits for c in alphabet]    return set(deletes + transposes + replaces + inserts) 

Now, look up each of these edits in your list.

Peter's article is a great read and worth reading.

like image 28
Amjith Avatar answered Sep 28 '22 04:09

Amjith