I have two tables that I need to merge together in PostgreSQL, on the common variable "company name." Unfortunately many of the company names don't match exactly (i.e. MICROSOFT in one table, MICROSFT in the other). I've tried removing common words from both columns such as "corporation" or "inc" or "ltd" in order to try to standardize names across both tables, but I'm having trouble thinking of additional strategies. Any ideas?
Thanks.
Also, if necessary I can do this in R.
Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.
The fuzzy matching methods look for strings that approximately match a pattern. Some fuzzy matching methods, such as Acronym and Name Variant, identify similarities using hard-coded dictionaries. Because the dictionaries aren't comprehensive, results can include unexpected or missing matches.
What is fuzzy name matching? Fuzzy matching assigns a probability to a match between 0.0 and 1.0 based on linguistic and statistical methods instead of just choosing either 1 (true) or 0 (false). As a result, names Robert and Bob can be a match with high probability even though they're not identical.
Have you considered the fuzzystrmatch module? You can use soundex
, difference
, levenshtein
, metaphone
and dmetaphone
, or a combination.
fuzzystrmatch documentation
SELECT something
FROM somewhere
WHERE levenshtein(item1, item2) < Carefully_Selected_Threshold
For example the levenshtein distance from MICROSOFT to MICROSFT is one (1).
levenshtein(dmetaphone('MICROSOFT'), dmetaphone('MICROSFT')
The above returns zero (0). Combining levenshtein and dmetaphone could help you match lots of misspellings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With