I have a list with city names, which some of them are misspelled:
['bercelona', 'emstrdam', 'Praga']
And a list with all possible city names well spelled:
['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']
I'm looking for an algorithm able to find the closest match between the names of the first and second list, and returns the first list with its well spelled names. So it should return the following list:
['Barcelona', 'Amsterdam', 'Prague']
Checking of spelling is a basic requirement in any text processing or analysis. The python package pyspellchecker provides us this feature to find the words that may have been mis-spelled and also suggest the possible corrections.
You may use built-in Ratcliff and Obershelp algorithm:
def is_similar(first, second, ratio):
return difflib.SequenceMatcher(None, first, second).ratio() > ratio
first = ['bercelona', 'emstrdam', 'Praga']
second = ['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']
result = [s for f in first for s in second if is_similar(f,s, 0.7)]
print result
['Barcelona', 'Amsterdam', 'Prague']
Where 0.7 is coefficient of similarity. It may do some tests for your case and set this value. It shows how similar are both of strings(1 - it's the same string, 0 - very different strings)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With