How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

Tags:

My users will import through cut and paste a large string that will contain company names.

I have an existing and growing MYSQL database of companies names, each with a unique company_id.

I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.

Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **

For example, someone writes:

 Microsoft       -> Microsoft Bare Essentials -> Bare Escentuals Polycom, Inc.   -> Polycom

I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:

How to find best fuzzy match for a string in a large string database

Matching inexact company names in Java

628

asked Dec 15 '08 21:12

AFG

1 Answers

You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).

The drawbacks of SOUNDEX() are:

its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
for MySQL, at least according to the docs, SOUNDEX is broken for unicode input

Example:

SELECT SOUNDEX('Microsoft') SELECT SOUNDEX('Microsift') SELECT SOUNDEX('Microsift Corporation') SELECT SOUNDEX('Microsift Subsidary')  /* all of these return 'M262' */

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).

110

answered Oct 11 '22 21:10

Tomalak

Related questions
                            
                                How can I convert a string to a float in mysql?
                            
                                use mysql SUM() in a WHERE clause
                            
                                Copy mysql database from remote server to local computer
                            
                                How to take complete backup of mysql database using mysqldump command line utility
                            
                                MySQL order by "best match"
                            
                                Fatal error: Please read "Security" section of the manual to find out how to run mysqld as root
                            
                                How can I add an INDEX with Doctrine 2 to a column without making it a primary key?
                            
                                Delete all rows with timestamp older than x days
                            
                                Parse date in MySQL
                            
                                GROUP_CONCAT with limit
                            
                                How to retrieve SQL result column value using column name in Python?
                            
                                Insert and set value with max()+1 problems
                            
                                Get records of current month [duplicate]
                            
                                MySQL IFNULL ELSE
                            
                                Multiple Table Select vs. JOIN (performance)
                            
                                Unique key with NULLs
                            
                                Does dropping a table in MySQL also drop the indexes?
                            
                                Error Code: 1290. The MySQL server is running with the --secure-file-priv option so it cannot execute this statement
                            
                                Select columns across different databases
                            
                                MySQL truncates concatenated result of a GROUP_CONCAT function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

Tags:

string

matching

mysql

fuzzy-search

AFG

People also ask

1 Answers

Tomalak

Recent Activity

Donate For Us