Finding the Longest Common Substring in a Large Data Set

Tags:

In the past few days I've researched this extensively, I've read so many things that I am now more confused then ever. How does one find the longest common sub string in a large data set? The idea is to remove duplicate content from this data set (of varying lengths, so the algo will need to run continuously). By large data set I mean approximately 100mb of text.

Suffix tree? Suffix array? Rabin-Karp? What's the best way? And is there a library out there that can help me?

Really hoping for a good response, my head hurts a lot. Thank you! :-)

363

asked Nov 17 '10 20:11

diffuse

1 Answers

I've always been using suffix arrays. Because I've been told always this is the fastest way there.

If you are running out of memory on the machine the algorithm is running, you can always save your array in a file on your hard-drive. It will slow down considerably the algorithm but it will provide the result, alt least.

And I don't think that a library will do a better job than a good written and clean algorithm.

LE: Btw, you don't need to remove any data in order to find the longest common substring.

From the Longest Common Substring Problem:

function LCSubstr(S[1..m], T[1..n])
    L := array(1..m, 1..n)
    z := 0
    ret := {}
    for i := 1..m
        for j := 1..n
            if S[i] = T[j]
                if i = 1 or j = 1
                    L[i,j] := 1
                else
                    L[i,j] := L[i-1,j-1] + 1
                if L[i,j] > z
                    z := L[i,j]
                    ret := {}
                if L[i,j] = z
                    ret := ret ∪ {S[i-z+1..i]}
    return ret

You don't need to sort anything, you have only to parse once your 100MB data, and buid an n*m array of chars to store your computing. Also check this page

LE: Rabin-Karp is a pattern matching algorithm, you don't need it here.

answered Oct 24 '22 03:10

sdadffdfd

Related questions
                            
                                Minimum-Waste Print Job Grouping Algorithm?
                            
                                Placing 2D shapes in a rectangle efficiently. How to approach it?
                            
                                Optimal solution for creating a pile of boxes
                            
                                Genetic algorithm and Tetris
                            
                                which flood-fill algorithm is better for performance?
                            
                                Techniques needed to write an arithmetic expression parser
                            
                                Quick relative ranking algorithm
                            
                                Number of distinct sums of subsets
                            
                                Does .NET really use NFA for regular expression engine?
                            
                                How does OEIS do subsequence search?
                            
                                Finding out whether there exist two identical substrings one next to another
                            
                                Longest palindrome in a string
                            
                                Efficient (time and space complexity) data structure for dense and sparse matrix
                            
                                Quiescent State Based Reclamation vs Epoch Based Reclamation
                            
                                Understanding DynamicTreeCut algorithm for cutting a dendrogram
                            
                                Face clustering using Chinese Whispers algorithm
                            
                                Explaining the math behind an algorithm
                            
                                Get minimum Euclidean distance between a given vector and vectors in the database
                            
                                Fitting rectangles together in optimal fashion
                            
                                algorithm to enumerate all possible paths

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding the Longest Common Substring in a Large Data Set

Tags:

string

algorithm

suffix-tree

large-files

diffuse

People also ask

1 Answers

sdadffdfd

Recent Activity

Donate For Us