I am looking for a library/class that allows smart compare of two strings. At best it would give as a result percent of how two strings are alike. I am comparing company names, addresses that are recordered in different repositories, thus having many misspellings or inconsistencies in names.
Sample strings to compare:
"Good Company Ltd." vs. "GoodCompany"
"Baker Street 2" vs. "Baker Str. 2"
If I get a result in percentage of alikeness, than this can be an input for smart merge of such data.
Do you know any good libraries that would allow such smart string compare?
String comparison means to check if the given two string are equal, if first string is greater than second string, or if first string is less than second string.
Use the in operator for partial matches, i.e., whether one string contains the other string. x in y returns True if x is contained in y ( x is a substring of y ), and False if it is not. If each character of x is contained in y discretely, False is returned.
Levenshtein is not appropriate in this case. "Good Company Ltd" and "GoodCompany" if trimmed have a distance = 3 while "Good Company Ltd" and "Food Company Ltd" have a distance of 1, but totally a different meaning. I suggest Metaphone or Double Metaphone algorithm.
Using online metaphone comparer the results are:
Good Company Ltd = KTKMPNLTT GoodCompany = KTKMPN Food Company Ltd = FTKMPNLTT GoodCompanyLLC = KTKMPNLK
In this way you know that GoodCompany
, Good Company Ltd
and GoodCompanyLLC
are similar, while Food Company
is misspelled or totally not related (KTKMPN
is contained both in KTKMPNLTT
and KTKMPNLK
but not in FTKMPNLTT
).
Look here for other algorithms comparisons.
You might want to look for Levenshtein Distance implementation. It shows how many characters insertions/deletions and substitutions are required to make two strings equal.
Here is a post about libraries in C# that implement Levenshtein Distance and other text-comparison algorithms: .NET library for text algorithms?.
However I think you'll have to use some combination of methods, because using Levenshtein will tell you that 'Good Company Ltd.' is more similar to 'Bad Company Ltd.' than to 'GoodCompany'.
Maybe you'll have to do some preprocessing by expanding 'str.' to 'street' and removing 'Ltd.' as 'meaningless' word in terms of string comparison.
UPDATE 1
Francesco De Lisi suggests to use Phonetic algoritms. Looks like they are better suited for comparing misspelled names. Still you'll need to split addresses to phonetic / non-phonetic parts (like building numbers) and compare them separately.
UPDATE 2
As for addresses comparison, this post suggests to use Google Maps API for this purpose and another post discusses address parsing. I guess that Google can produce reliable results because they have a database of street addresses where they can find the most correct street name spelling. Without list of correct street/company names you may encounter some strange name that is incorrect, however many different correct names would be similar to it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With