Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm to search for matching substrings longer than 14 characters of a text inside another text

I've got a long text (about 5 MB filesize) and another text called pattern (around 2000 characters).

The task is to find matching parts from a genom-pattern which are 15 characters or longer in the long text.

example:

long text: ACGTACGTGTCA AAAACCCCGGGGTTTTA GTACCCGTAGGCGTAT AND MUCH LONGER

pattern: ACGGTATTGAC AAAACCCCGGGGTTTTA TGTTCCCAG

I'm looking for an efficient (and easy to understand and implement) algorithm.

A bonus would be a way to implement this with just char-arrays in C++ if thats possible at all.

like image 638
Hedge Avatar asked May 08 '12 00:05

Hedge


1 Answers

Here's one algorithm - I'm not sure if it has a name. It requires a "rolling hash" - a (non-cryptographic) hash function that has the property that given the hash of a sequence AB...C, it is efficient to calculate the hash of the sequence B...CD.

  1. Calculate the rolling hashes of the sequences pattern[0..14], pattern[1..15], pattern[2..16]... and store each index in pattern in a hash table.

  2. Caculate the rolling hash of haystack[0..14] and see if it is in the hash table. If it is, compare haystack[0..14] to pattern[pos..pos+14] where pos was retrieved from the hash table.

  3. From the rolling hash of haystack[0..14], efficiently compute the rolling hash of haystack[1..15] and see if it is in the hash table. Repeat until you reach the end of haystack.

Note that your 15 character strings only have 230 possible values so your "hash function" could be a simple mapping to the value of the string treated as a 15 digit base-4 number, which is fast to compute, has the rolling hash property and is unique.

like image 120
caf Avatar answered Nov 09 '22 22:11

caf