Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding if two strings are almost similar

I want to find out if you strings are almost similar. For example, string like 'Mohan Mehta' should match 'Mohan Mehte' and vice versa. Another example, string like 'Umesh Gupta' should match 'Umash Gupte'.

Basically one string is correct and other one is a mis-spelling of it. All my strings are names of people.

Any suggestions on how to achieve this.

Solution does not have to be 100 percent effective.

like image 941
Salil Agarwal Avatar asked Jul 26 '15 23:07

Salil Agarwal


People also ask

How do you know if two strings are almost equal?

Two strings word1 and word2 are considered almost equivalent if the differences between the frequencies of each letter from 'a' to 'z' between word1 and word2 is at most 3 . Given two strings word1 and word2 , each of length n , return true if word1 and word2 are almost equivalent, or false otherwise.

How do you check if two strings are almost equal in python?

Use the == and != operators to compare two strings for equality. Use the is operator to check if two strings are the same instance. Use the < , > , <= , and >= operators to compare strings alphabetically.

Which function is useful to check whether the two strings are exactly same or not?

The strcmp() function, is used to compare the strings (str1,str2). The strings str1 and str2 will be compared using this function. If the function returns a value 0, it signifies that the strings are equal otherwise, strings are not equal.


4 Answers

You can use difflib.sequencematcher if you want something from the stdlib:

from difflib import SequenceMatcher
s_1 = 'Mohan Mehta'
s_2 = 'Mohan Mehte'
print(SequenceMatcher(a=s_1,b=s_2).ratio())
0.909090909091

fuzzywuzzy is one of numerous libs that you can install, it uses the difflib module with python-Levenshtein. You should also check out the wikipage on Approximate_string_matching

like image 77
Padraic Cunningham Avatar answered Oct 02 '22 08:10

Padraic Cunningham


Another approach is to use a "phonetic algorithm":

A phonetic algorithm is an algorithm for indexing of words by their pronunciation.

For example using the soundex algorithm:

>>> import soundex
>>> s = soundex.getInstance()
>>> s.soundex("Umesh Gupta")
'U5213'
>>> s.soundex("Umash Gupte")
'U5213'
>>> s.soundex("Umesh Gupta") == s.soundex("Umash Gupte")
True
like image 31
Steven Kryskalla Avatar answered Oct 02 '22 08:10

Steven Kryskalla


What you want is a string distance. There many flavors, but I would recommend starting with the Levenshtein distance.

like image 36
fgregg Avatar answered Oct 02 '22 10:10

fgregg


you might want to look at NLTK (The Natural Language Toolkit), specifically the nltk.metrics package, which implements various string distance algorithms, including the Levenshtein distance mentioned already.

like image 21
Steven Kay Avatar answered Oct 02 '22 08:10

Steven Kay