Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find best fuzzy match for a string in a large string database

Tags:

I have a database of strings (arbitrary length) which holds more than one million items (potentially more).

I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second.

My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their length.

However, as I will need to perform this operation very often, I'm thinking about building an index of the db strings to keep in memory and query the index, not the db directly.

Any ideas on how to approach this problem differently or how to build the in-memory index?

like image 998
guillermooo Avatar asked Nov 21 '08 17:11

guillermooo


People also ask

How do you evaluate a fuzzy match?

1) How to calculate the score in fuzzy string matching? One of the most effective ways to calculate scores for a fuzzy string matching algorithm is by using cosine similarity. The cosine similarity between two non-zero vectors is simply the cosine of the angle between these vectors.

Can you do a fuzzy match in SQL?

You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches.

How accurate is fuzzy matching?

So far, fuzzy matching is not capable of replacing humans in language translation processing, but with more research and artificial intelligence technique application, it may be capable of replacing humans in the future with nearly 100 percent accuracy.


2 Answers

This paper seems to describe exactly what you want.

Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.

like image 160
zaratustra Avatar answered Oct 03 '22 05:10

zaratustra


You didn't mention your database system, but for PostrgreSQL you could use the following contrib module: trgm - Trigram matching for PostgreSQL

The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching.

like image 24
Patryk Kordylewski Avatar answered Oct 03 '22 05:10

Patryk Kordylewski