The problem is easy to explain: we have two big arrays (32 bit integer values) and we have to find all common sequences above a given number of consecutive position (n).
For instance, if n=3 and arrays to compare are:
a = [1, 3, 5, 7, 3, 2, 7, 4, 6, 7, 2, 1, 0, 4, 6]
b = [2, 5, 7, 3, 2, 3, 4, 5, 6, 3, 2, 7, 4, 6, 0]
The algoritmh should return, two arrays:
r0 = [5, 7, 3, 2]
r1 = [3, 2, 7, 4, 6]
(or better, its relative positions to first array and the number of consecutive bytes matched).
I believe a good point to start is the Longest Common Substring Algorithm, but perhaps anybody knows an algorithm that fits better or exactly with my problem.
I think the algorithm for finding LCS using suffix tree is a perfect fit. You build the suffix tree the same way, but in the final phase, you're not looking for the deepest node that has descendants for both strings. You're looking for all nodes with the depth of more than n
that have descendants for both strings.
I think the algorithms on the wikipedia page you reference do almost exactly what you want. You just need to modify them to keep all answers over a certain size rather than only keeping the longest answer. For instance, the dynamic programming solution on that page could be modified as follows:
function LCSubstr(S[1..m], T[1..n], min_size)
L := array(1..m, 1..n)
ret := {}
for i := 1..m
for j := 1..n
if S[i] = T[j]
if i = 1 or j = 1
L[i,j] := 1
else
L[i,j] := L[i-1,j-1] + 1
if L[i,j] >= min_size
ret := ret ∪ {S[i-z+1..z]}
return
As written this will include prefixes as well as longest matches. Those could be filtered out by tracking what string was found into L and removing a prefix from the return set when we discover that it has an extension.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With