Let's say I have a string
"Hello"
and a list
words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']
How can I find the n words
that are the closest to "Hello"
and present in the list words
?
In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]
So the strategy is to sort the list words from the closest word to the furthest.
I thought about something like this
word = 'Hello' for i, item in enumerate(words): if lower(item) > lower(word): ...
but it's very slow in large lists.
UPDATE difflib
works but it's very slow also. (words list
has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!
We can find the nearest value in the list by using the min() function. Define a function that calculates the difference between a value in the list and the given value and returns the absolute value of the result. Then call the min() function which returns the closest value to the given value.
Python has a built-in package called difflib with the function get_close_matches() that can help us. get_close_matches(word, possibilities, n, cutoff) accepts four parameters: word - the word to find close matches for in our list. possibilities - the list in which to search for close matches of word.
Use difflib.get_close_matches
.
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format'] >>> difflib.get_close_matches('Hello', words) ['hello', 'Hallo', 'hallo']
Please look at the documentation, because the function returns 3 or less closest matches by default.
There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.
http://norvig.com/spell-correct.html
The idea is to build all possible edits of your word,
hello - helo - deletes hello - helol - transpose hello - hallo - replaces hello - heallo - inserts def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts)
Now, look up each of these edits in your list.
Peter's article is a great read and worth reading.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With