I want to find out if you strings are almost similar. For example, string like 'Mohan Mehta' should match 'Mohan Mehte' and vice versa. Another example, string like 'Umesh Gupta' should match 'Umash Gupte'.
Basically one string is correct and other one is a mis-spelling of it. All my strings are names of people.
Any suggestions on how to achieve this.
Solution does not have to be 100 percent effective.
Two strings word1 and word2 are considered almost equivalent if the differences between the frequencies of each letter from 'a' to 'z' between word1 and word2 is at most 3 . Given two strings word1 and word2 , each of length n , return true if word1 and word2 are almost equivalent, or false otherwise.
Use the == and != operators to compare two strings for equality. Use the is operator to check if two strings are the same instance. Use the < , > , <= , and >= operators to compare strings alphabetically.
The strcmp() function, is used to compare the strings (str1,str2). The strings str1 and str2 will be compared using this function. If the function returns a value 0, it signifies that the strings are equal otherwise, strings are not equal.
You can use difflib.sequencematcher if you want something from the stdlib:
from difflib import SequenceMatcher
s_1 = 'Mohan Mehta'
s_2 = 'Mohan Mehte'
print(SequenceMatcher(a=s_1,b=s_2).ratio())
0.909090909091
fuzzywuzzy is one of numerous libs that you can install, it uses the difflib module with python-Levenshtein. You should also check out the wikipage on Approximate_string_matching
Another approach is to use a "phonetic algorithm":
A phonetic algorithm is an algorithm for indexing of words by their pronunciation.
For example using the soundex algorithm:
>>> import soundex
>>> s = soundex.getInstance()
>>> s.soundex("Umesh Gupta")
'U5213'
>>> s.soundex("Umash Gupte")
'U5213'
>>> s.soundex("Umesh Gupta") == s.soundex("Umash Gupte")
True
What you want is a string distance. There many flavors, but I would recommend starting with the Levenshtein distance.
you might want to look at NLTK (The Natural Language Toolkit), specifically the nltk.metrics package, which implements various string distance algorithms, including the Levenshtein distance mentioned already.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With