Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Super fuzzy name checking?

I'm working on some stuff for an in-house CRM. The company's current frontend allows for lots of duplicates. I'm trying to stop end-users from putting in the same person because they searched for 'Bill Johnson' and not 'William Johnson.' So the user will put in some information about their new customer and we'll find the similar names (including fuzzy names) and match them against what is already in our database and ask if they meant those things... Does such a database or technology exist?

like image 962
Jon Phenow Avatar asked Jul 20 '10 13:07

Jon Phenow


People also ask

What is a fuzzy name search?

A fuzzy search is a technique that uses search algorithms to find strings that match patterns approximately. It's particularly useful for helping users find webpages without having to know exactly what they're looking for or how a word is spelled.

What is fuzzy name matching?

What is fuzzy name matching? Fuzzy matching assigns a probability to a match between 0.0 and 1.0 based on linguistic and statistical methods instead of just choosing either 1 (true) or 0 (false). As a result, names Robert and Bob can be a match with high probability even though they're not identical.

How do I test fuzzy search?

A fuzzy search query searches for character sequences that are not only the same but similar to the query term. Use the tilde symbol (~) at the end of a term to do a fuzzy search. For example, the following query finds documents that include the terms analytics, analyze, analysis, and so on.

What is fuzzy matching example?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.


2 Answers

I implemented such a functionality on one website. I use double_metaphone() + levenstein() in PHP. I precalculate a double_metaphone() for each entry in the dabatase, which I lookup using a SELECT of the first x chars of the 'metaphoned' searched term.

Then I sort the returned result according to their levenstein distance. double_metaphone() is not part of any PHP library (last time I checked), so I borrowed a PHP implementation I found somewhere a long while ago on the net (site no longer on line). I should post it somewhere I suppose.

EDIT: The website is still in archive.org: http://web.archive.org/web/20080728063208/http://swoodbridge.com/DoubleMetaPhone/

or Google cache: http://webcache.googleusercontent.com/search?q=cache:Tr9taWl9hMIJ:swoodbridge.com/DoubleMetaPhone/+Stephen+Woodbridge+double_metaphon

which leads to many other useful links with source code for double_metaphone(), including one in Javascript on github: http://github.com/maritz/js-double-metaphone

EDIT: Went through my old code, and here are roughly the steps of what I do, pseudo coded to keep it clear:

1) Precompute a double_metaphone() for every word in the database, i.e., $word='blahblah'; $soundslike=double_metaphone($word);

2) At lookup time, $word is fuzzy-searched against the database: $soundslike = double_metaphone($word)

4) SELECT * FROM table WHERE soundlike LIKE $soundlike (if you have levenstein stored as a procedure, much better: SELECT * FROM table WHERE levenstein(soundlike,$soundlike) < mythreshold ORDER BY levenstein(word,$word) ASC LIMIT ... etc.

It has worked well for me, although I can't use a stored procedure, since I have no control over the server and it's using MySQL 4.20 or something.

like image 109
R. Hill Avatar answered Sep 17 '22 22:09

R. Hill


I asked a similar question once. Name Hypocorism List I never did get around to doing anything with it but the problem has come up again at work so I might write and open source a library in .net for doing some matching.

Update: I ported the perl module I mentioned there to C# and put it up on github. http://github.com/stimms/Nicknames

like image 24
stimms Avatar answered Sep 18 '22 22:09

stimms