Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching fuzzy strings

I have two tables that I need to merge together in PostgreSQL, on the common variable "company name." Unfortunately many of the company names don't match exactly (i.e. MICROSOFT in one table, MICROSFT in the other). I've tried removing common words from both columns such as "corporation" or "inc" or "ltd" in order to try to standardize names across both tables, but I'm having trouble thinking of additional strategies. Any ideas?

Thanks.

Also, if necessary I can do this in R.

like image 225
aesir Avatar asked Jan 19 '12 16:01

aesir


People also ask

What is fuzzy matching example?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.

What is fuzzy in matching rule?

The fuzzy matching methods look for strings that approximately match a pattern. Some fuzzy matching methods, such as Acronym and Name Variant, identify similarities using hard-coded dictionaries. Because the dictionaries aren't comprehensive, results can include unexpected or missing matches.

What is fuzzy name matching?

What is fuzzy name matching? Fuzzy matching assigns a probability to a match between 0.0 and 1.0 based on linguistic and statistical methods instead of just choosing either 1 (true) or 0 (false). As a result, names Robert and Bob can be a match with high probability even though they're not identical.


1 Answers

Have you considered the fuzzystrmatch module? You can use soundex, difference, levenshtein, metaphone and dmetaphone, or a combination.

fuzzystrmatch documentation

SELECT something
FROM somewhere
WHERE levenshtein(item1, item2) < Carefully_Selected_Threshold

For example the levenshtein distance from MICROSOFT to MICROSFT is one (1).

levenshtein(dmetaphone('MICROSOFT'), dmetaphone('MICROSFT')

The above returns zero (0). Combining levenshtein and dmetaphone could help you match lots of misspellings.

like image 65
Anders Marzi Tornblad Avatar answered Nov 03 '22 01:11

Anders Marzi Tornblad