Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do fuzzy string search without a heavy database?

I have a mapping of catalog numbers to product names:

35  cozy comforter
35  warm blanket
67  pillow

and need a search that would find misspelled, mixed names like "warm cmfrter".

We have code using edit-distance (difflib), but it probably won't scale to the 18000 names.

I achieved something similar with Lucene, but as PyLucene only wraps Java that would complicate deployment to end-users.

SQLite doesn't usually have full-text or scoring compiled in.

The Xapian bindings are like C++ and have some learning curve.

Whoosh is not yet well-documented but includes an abusable spell-checker.

What else is there?

like image 203
Tobias Avatar asked May 07 '09 13:05

Tobias


People also ask

Is Fuzzy Wuzzy slow?

FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow. The estimated time of computing similarity scores for a 406,000-entity dataset of addresses is 337 hours.

How do you evaluate a fuzzy match?

One of the most effective ways to calculate scores for a fuzzy string matching algorithm is by using cosine similarity. The cosine similarity between two non-zero vectors is simply the cosine of the angle between these vectors.

How accurate is fuzzy matching?

Higher matching accuracy: fuzzy matching proves to be a far more accurate method of finding matching across two or more datasets. Unlike deterministic matching that determines matches on a 0 or 1 basis, fuzzy matching can detect variations that lie between 0 and 1 basis on a given matching threshold.

How do I create a fuzzy search?

A fuzzy search query searches for character sequences that are not only the same but similar to the query term. Use the tilde symbol (~) at the end of a term to do a fuzzy search. For example, the following query finds documents that include the terms analytics, analyze, analysis, and so on.


2 Answers

Apparently the only way to make fuzzy comparisons fast is to do less of them ;)

Instead of writing another n-gram search or improving the one in Whoosh we now keep a word index, retrieve all entries that have at least one (correctly spelled) word in common with the query, and use difflib to rank those. Works well enough in this case.

like image 170
Tobias Avatar answered Sep 23 '22 02:09

Tobias


You will get too many false positives with the SOUNDEX implementation. There are only 26,000 (at most) possible SOUNDEX codes.

Though the Metaphone algorithim was designed for English surnames, it works pretty well for misspellings; I used it once in a branch locator and it was quite successful.

Add a field with the Metaphone translation and match against it if no exact match is found. You will still get false positives, but fewer that with other algorithims.

like image 35
R Ubben Avatar answered Sep 22 '22 02:09

R Ubben