Efficient algorithm to search for matching substrings longer than 14 characters of a text inside another text

Question

I've got a long text (about 5 MB filesize) and another text called pattern (around 2000 characters).

The task is to find matching parts from a genom-pattern which are 15 characters or longer in the long text.

example:

long text: ACGTACGTGTCA AAAACCCCGGGGTTTTA GTACCCGTAGGCGTAT AND MUCH LONGER

pattern: ACGGTATTGAC AAAACCCCGGGGTTTTA TGTTCCCAG

I'm looking for an efficient (and easy to understand and implement) algorithm.

A bonus would be a way to implement this with just char-arrays in C++ if thats possible at all.

caf · Accepted Answer

Here's one algorithm - I'm not sure if it has a name. It requires a "rolling hash" - a (non-cryptographic) hash function that has the property that given the hash of a sequence AB...C, it is efficient to calculate the hash of the sequence B...CD.

Calculate the rolling hashes of the sequences pattern[0..14], pattern[1..15], pattern[2..16]... and store each index in pattern in a hash table.
Caculate the rolling hash of haystack[0..14] and see if it is in the hash table. If it is, compare haystack[0..14] to pattern[pos..pos+14] where pos was retrieved from the hash table.
From the rolling hash of haystack[0..14], efficiently compute the rolling hash of haystack[1..15] and see if it is in the hash table. Repeat until you reach the end of haystack.

Note that your 15 character strings only have 2³⁰ possible values so your "hash function" could be a simple mapping to the value of the string treated as a 15 digit base-4 number, which is fast to compute, has the rolling hash property and is unique.

Efficient algorithm to search for matching substrings longer than 14 characters of a text inside another text

Tags:

c++

c

string

algorithm

Hedge

1 Answers

caf

Recent Activity

Donate For Us

Efficient algorithm to search for matching substrings longer than 14 characters of a text inside another text

Tags:

c++

c

string

algorithm

Hedge

1 Answers

caf

Related questions

Recent Activity

Donate For Us