Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the Longest Common Substring in a Large Data Set

In the past few days I've researched this extensively, I've read so many things that I am now more confused then ever. How does one find the longest common sub string in a large data set? The idea is to remove duplicate content from this data set (of varying lengths, so the algo will need to run continuously). By large data set I mean approximately 100mb of text.

Suffix tree? Suffix array? Rabin-Karp? What's the best way? And is there a library out there that can help me?

Really hoping for a good response, my head hurts a lot. Thank you! :-)

like image 363
diffuse Avatar asked Nov 17 '10 20:11

diffuse


People also ask

How do you find the longest common substring of multiple strings?

The longest common substrings of a set of strings can be found by building a generalized suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it.

What is the time complexity for finding the longest common substring?

The time complexity of finding the longest non-repeated substring is O(n).

How do I find the largest substring in Python?

Python Program To Find Length Of The Longest Substring Without Repeating Characters. Given a string str, find the length of the longest substring without repeating characters. For “ABDEFGABEF”, the longest substring are “BDEFGA” and “DEFGAB”, with length 6. For “BBBB” the longest substring is “B”, with length 1.


1 Answers

I've always been using suffix arrays. Because I've been told always this is the fastest way there.

If you are running out of memory on the machine the algorithm is running, you can always save your array in a file on your hard-drive. It will slow down considerably the algorithm but it will provide the result, alt least.

And I don't think that a library will do a better job than a good written and clean algorithm.

LE: Btw, you don't need to remove any data in order to find the longest common substring.

From the Longest Common Substring Problem:

function LCSubstr(S[1..m], T[1..n])
    L := array(1..m, 1..n)
    z := 0
    ret := {}
    for i := 1..m
        for j := 1..n
            if S[i] = T[j]
                if i = 1 or j = 1
                    L[i,j] := 1
                else
                    L[i,j] := L[i-1,j-1] + 1
                if L[i,j] > z
                    z := L[i,j]
                    ret := {}
                if L[i,j] = z
                    ret := ret ∪ {S[i-z+1..i]}
    return ret

You don't need to sort anything, you have only to parse once your 100MB data, and buid an n*m array of chars to store your computing. Also check this page

LE: Rabin-Karp is a pattern matching algorithm, you don't need it here.

like image 83
sdadffdfd Avatar answered Oct 24 '22 03:10

sdadffdfd