Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

My users will import through cut and paste a large string that will contain company names.

I have an existing and growing MYSQL database of companies names, each with a unique company_id.

I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.

Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **

For example, someone writes:

 Microsoft       -> Microsoft Bare Essentials -> Bare Escentuals Polycom, Inc.   -> Polycom 

I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:

How to find best fuzzy match for a string in a large string database

Matching inexact company names in Java

like image 628
AFG Avatar asked Dec 15 '08 21:12

AFG


People also ask

How would you do fuzzy matching using SQL?

You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches.

How do you do a fuzzy search?

Many search engines enable users to specifically request a fuzzy search in the search query by using a tilde (~) at the end of the word or term they want to search with fuzziness.

What is fuzzy autocomplete?

Fuzzy search is a technique that finds approximate matches for the search string or characters; even when it does not exactly match the entered information.

What is fuzzy matching example?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.


1 Answers

You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).

The drawbacks of SOUNDEX() are:

  • its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
  • the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
  • for MySQL, at least according to the docs, SOUNDEX is broken for unicode input

Example:

SELECT SOUNDEX('Microsoft') SELECT SOUNDEX('Microsift') SELECT SOUNDEX('Microsift Corporation') SELECT SOUNDEX('Microsift Subsidary')  /* all of these return 'M262' */ 

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).

like image 110
Tomalak Avatar answered Oct 11 '22 21:10

Tomalak