Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Smart string comparison

I am looking for a library/class that allows smart compare of two strings. At best it would give as a result percent of how two strings are alike. I am comparing company names, addresses that are recordered in different repositories, thus having many misspellings or inconsistencies in names.

Sample strings to compare:

 "Good Company Ltd." vs. "GoodCompany" 
 "Baker Street 2" vs. "Baker Str. 2" 

If I get a result in percentage of alikeness, than this can be an input for smart merge of such data.

Do you know any good libraries that would allow such smart string compare?

like image 909
Radoslaw Avatar asked May 23 '13 11:05

Radoslaw


People also ask

What is a string comparison?

String comparison means to check if the given two string are equal, if first string is greater than second string, or if first string is less than second string.

How do you compare two strings partially in Python?

Use the in operator for partial matches, i.e., whether one string contains the other string. x in y returns True if x is contained in y ( x is a substring of y ), and False if it is not. If each character of x is contained in y discretely, False is returned.


2 Answers

Levenshtein is not appropriate in this case. "Good Company Ltd" and "GoodCompany" if trimmed have a distance = 3 while "Good Company Ltd" and "Food Company Ltd" have a distance of 1, but totally a different meaning. I suggest Metaphone or Double Metaphone algorithm.

Using online metaphone comparer the results are:

Good Company Ltd = KTKMPNLTT
GoodCompany = KTKMPN
Food Company Ltd = FTKMPNLTT
GoodCompanyLLC = KTKMPNLK

In this way you know that GoodCompany, Good Company Ltd and GoodCompanyLLC are similar, while Food Company is misspelled or totally not related (KTKMPN is contained both in KTKMPNLTT and KTKMPNLK but not in FTKMPNLTT).

Look here for other algorithms comparisons.

like image 109
Francesco De Lisi Avatar answered Oct 08 '22 02:10

Francesco De Lisi


You might want to look for Levenshtein Distance implementation. It shows how many characters insertions/deletions and substitutions are required to make two strings equal.

Here is a post about libraries in C# that implement Levenshtein Distance and other text-comparison algorithms: .NET library for text algorithms?.

However I think you'll have to use some combination of methods, because using Levenshtein will tell you that 'Good Company Ltd.' is more similar to 'Bad Company Ltd.' than to 'GoodCompany'.

Maybe you'll have to do some preprocessing by expanding 'str.' to 'street' and removing 'Ltd.' as 'meaningless' word in terms of string comparison.

UPDATE 1

Francesco De Lisi suggests to use Phonetic algoritms. Looks like they are better suited for comparing misspelled names. Still you'll need to split addresses to phonetic / non-phonetic parts (like building numbers) and compare them separately.

UPDATE 2

As for addresses comparison, this post suggests to use Google Maps API for this purpose and another post discusses address parsing. I guess that Google can produce reliable results because they have a database of street addresses where they can find the most correct street name spelling. Without list of correct street/company names you may encounter some strange name that is incorrect, however many different correct names would be similar to it.

like image 24
Artemix Avatar answered Oct 08 '22 01:10

Artemix