In the past few days I've researched this extensively, I've read so many things that I am now more confused then ever. How does one find the longest common sub string in a large data set? The idea is to remove duplicate content from this data set (of varying lengths, so the algo will need to run continuously). By large data set I mean approximately 100mb of text.
Suffix tree? Suffix array? Rabin-Karp? What's the best way? And is there a library out there that can help me?
Really hoping for a good response, my head hurts a lot. Thank you! :-)
The longest common substrings of a set of strings can be found by building a generalized suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it.
The time complexity of finding the longest non-repeated substring is O(n).
Python Program To Find Length Of The Longest Substring Without Repeating Characters. Given a string str, find the length of the longest substring without repeating characters. For “ABDEFGABEF”, the longest substring are “BDEFGA” and “DEFGAB”, with length 6. For “BBBB” the longest substring is “B”, with length 1.
I've always been using suffix arrays. Because I've been told always this is the fastest way there.
If you are running out of memory on the machine the algorithm is running, you can always save your array in a file on your hard-drive. It will slow down considerably the algorithm but it will provide the result, alt least.
And I don't think that a library will do a better job than a good written and clean algorithm.
LE: Btw, you don't need to remove any data in order to find the longest common substring.
From the Longest Common Substring Problem:
function LCSubstr(S[1..m], T[1..n])
L := array(1..m, 1..n)
z := 0
ret := {}
for i := 1..m
for j := 1..n
if S[i] = T[j]
if i = 1 or j = 1
L[i,j] := 1
else
L[i,j] := L[i-1,j-1] + 1
if L[i,j] > z
z := L[i,j]
ret := {}
if L[i,j] = z
ret := ret ∪ {S[i-z+1..i]}
return ret
You don't need to sort anything, you have only to parse once your 100MB data, and buid an n*m array of chars to store your computing. Also check this page
LE: Rabin-Karp is a pattern matching algorithm, you don't need it here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With