Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding long repeated substrings in a massive string

I naively imagined that I could build a suffix trie where I keep a visit-count for each node, and then the deepest nodes with counts greater than one are the result set I'm looking for.

I have a really really long string (hundreds of megabytes). I have about 1 GB of RAM.

This is why building a suffix trie with counting data is too inefficient space-wise to work for me. To quote Wikipedia's Suffix tree:

storing a string's suffix tree typically requires significantly more space than storing the string itself.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

And that was wikipedia's comments on the tree, not trie.

How can I find long repeated sequences in such a large amount of data, and in a reasonable amount of time (e.g. less than an hour on a modern desktop machine)?

(Some wikipedia links to avoid people posting them as the 'answer': Algorithms on strings and especially Longest repeated substring problem ;-) )

like image 822
Will Avatar asked Dec 29 '08 21:12

Will


3 Answers

The effective way to do this is to create an index of the sub-strings, and sort them. This is an O(n lg n) operation.

BWT compression does this step, so its a well understood problem and there are radix and suffix (claim O(n)) sort implementations and such to make it as efficient as possible. It still takes a long time, perhaps several seconds for large texts.

If you want to use utility code, C++ std::stable_sort() performs much better than std::sort() for natural language (and much faster than C's qsort(), but for different reasons).

Then visiting each item to see the length of its common substring with its neighbours is O(n).

like image 118
Will Avatar answered Oct 23 '22 02:10

Will


You could look at disk-based suffix trees. I found this Suffix tree implementation library through Google, plus a bunch of articles that could help implementing it yourself.

like image 34
orip Avatar answered Oct 23 '22 03:10

orip


You could solve this using divide and conquer. I think this should be the same algorithmic complexity as using a trie, but maybe less efficient implementation-wise

void LongSubstrings(string data, string prefix, IEnumerable<int> positions)
{
    Dictionary<char, DiskBackedBuffer> buffers = new Dictionary<char, DiskBackedBuffer>();
    foreach (int position in positions)
    {
        char nextChar = data[position];
        buffers[nextChar].Add(position+1);
    }

    foreach (char c in buffers.Keys)
    {
        if (buffers[c].Count > 1)
            LongSubstrings(data, prefix + c, buffers[c]);
        else if (buffers[c].Count == 1)
            Console.WriteLine("Unique sequence: {0}", prefix + c);
    }
}

void LongSubstrings(string data)
{
    LongSubstrings(data, "", Enumerable.Range(0, data.Length));
}

After this, you would need to make a class that implemented DiskBackedBuffer such that it was a list of numbers, and when the buffer got to a certain size, it would write itself out to disk using a temporary file, and recall from disk when read from.

like image 42
FryGuy Avatar answered Oct 23 '22 04:10

FryGuy