Let's say I have a <code>string</code> <code>"Hello"</code> and a list <pre class="prettyprint"><code>words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format'] </code></pre> How can I find the <code>n words</code> that are the closest to <code>"Hello"</code> and present in the list <code>words</code> ? In this case, we would have <code>['hello', 'hallo', 'Hallo', 'hi', 'format'...]</code> So the strategy is to sort the list words from the closest word to the furthest. I thought about something like this <pre class="prettyprint"><code>word = 'Hello' for i, item in enumerate(words): if lower(item) > lower(word): ... </code></pre> but it's very slow in large lists. UPDATE <code>difflib</code> works but it's very slow also. (<code>words list</code> has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!

Use <code>difflib.get_close_matches</code>. <pre class="prettyprint"><code>>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format'] >>> difflib.get_close_matches('Hello', words) ['hello', 'Hallo', 'hallo'] </code></pre> Please look at the documentation, because the function returns 3 or less closest matches by default.

Python: find closest string (from a list) to another string

Tags:

python

string

algorithm

list

Let's say I have a string "Hello" and a list

words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']

How can I find the n words that are the closest to "Hello" and present in the list words ?

In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]

So the strategy is to sort the list words from the closest word to the furthest.

I thought about something like this

word = 'Hello' for i, item in enumerate(words):     if lower(item) > lower(word):       ...

but it's very slow in large lists.

UPDATE difflib works but it's very slow also. (words list has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!

454

asked Apr 04 '12 20:04

Laura

2 Answers

Use difflib.get_close_matches.

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format'] >>> difflib.get_close_matches('Hello', words) ['hello', 'Hallo', 'hallo']

Please look at the documentation, because the function returns 3 or less closest matches by default.

173

answered Sep 28 '22 04:09

Oleh Prypin

There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.

http://norvig.com/spell-correct.html

The idea is to build all possible edits of your word,

hello - helo   - deletes     hello - helol  - transpose     hello - hallo  - replaces     hello - heallo - inserts       def edits1(word):    splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]    deletes    = [a + b[1:] for a, b in splits if b]    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]    replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]    inserts    = [a + c + b     for a, b in splits for c in alphabet]    return set(deletes + transposes + replaces + inserts)

Now, look up each of these edits in your list.

Peter's article is a great read and worth reading.

answered Sep 28 '22 04:09

Amjith

Related questions
                            
                                Easiest way to rm -rf in Python
                            
                                print python stack trace without exception being raised
                            
                                Unit tests for functions in a Jupyter notebook?
                            
                                What is the difference between Dataset.from_tensors and Dataset.from_tensor_slices?
                            
                                how to save a pylab figure into in-memory file which can be read into PIL image?
                            
                                What's an example use case for a Python classmethod?
                            
                                Define a method outside of class definition?
                            
                                Setuptools "development" Requirements
                            
                                Averaging over every n elements of a numpy array
                            
                                Is there an overhead when nesting functions in Python?
                            
                                How to disable password request for a Jupyter notebook session?
                            
                                Get index of a row of a pandas dataframe as an integer
                            
                                External JavaScript file is not getting added when running on Flask
                            
                                How to properly use mock in python with unittest setUp
                            
                                False or None vs. None or False
                            
                                python pandas extract year from datetime: df['year'] = df['date'].year is not working
                            
                                How to use python regex to replace using captured group? [duplicate]
                            
                                How to download a file via FTP with Python ftplib
                            
                                How to remove project in PyCharm?
                            
                                How to save a figure remotely with pylab? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With