I've got a long text (about 5 MB filesize) and another text called pattern (around 2000 characters).
The task is to find matching parts from a genom-pattern which are 15 characters or longer in the long text.
example:
long text: ACGTACGTGTCA AAAACCCCGGGGTTTTA GTACCCGTAGGCGTAT AND MUCH LONGER
pattern: ACGGTATTGAC AAAACCCCGGGGTTTTA TGTTCCCAG
I'm looking for an efficient (and easy to understand and implement) algorithm.
A bonus would be a way to implement this with just char-arrays in C++ if thats possible at all.
Here's one algorithm - I'm not sure if it has a name. It requires a "rolling hash" - a (non-cryptographic) hash function that has the property that given the hash of a sequence AB...C
, it is efficient to calculate the hash of the sequence B...CD
.
Calculate the rolling hashes of the sequences pattern[0..14]
, pattern[1..15]
, pattern[2..16]
... and store each index in pattern
in a hash table.
Caculate the rolling hash of haystack[0..14]
and see if it is in the hash table. If it is, compare haystack[0..14]
to pattern[pos..pos+14]
where pos
was retrieved from the hash table.
From the rolling hash of haystack[0..14]
, efficiently compute the rolling hash of haystack[1..15]
and see if it is in the hash table. Repeat until you reach the end of haystack
.
Note that your 15 character strings only have 230 possible values so your "hash function" could be a simple mapping to the value of the string treated as a 15 digit base-4 number, which is fast to compute, has the rolling hash property and is unique.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With