Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search for (Very) Approximate Substrings in a Large Database

I am trying to search for long, approximate substrings in a large database. For example, a query could be a 1000 character substring that could differ from the match by a Levenshtein distance of several hundred edits. I have heard that indexed q-grams could do this, but I don't know the implementation details. I have also heard that Lucene could do it, but is Lucene's levenshtein algorithm fast enough for hundreds of edits? Perhaps something out of the world of plagiarism detection? Any advice is appreciated.

like image 335
345871345 Avatar asked Nov 06 '22 10:11

345871345


2 Answers

Q-grams could be one approach, but there are others such as Blast, BlastP - which are used for Protein, nucleotide matches etc.

The Simmetrics library is a comprehensive collection of string distance approaches.

like image 85
Mikos Avatar answered Nov 18 '22 11:11

Mikos


Lucene does not seem to be the right tool here. In addition to Mikos' fine suggestions, I have heard about AGREP, FASTA and Locality-Sensitive Hashing(LSH). I believe that an efficient method should first prune the search space heavily, and only then do more sophisticated scoring on the remaining candidates.

like image 38
Yuval F Avatar answered Nov 18 '22 11:11

Yuval F