Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest repeated (k times) substring

I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what's already been answered.

This is for the Rosalind project problem LREP. I'm trying to find the longest k-peated substring in a string and I've been provided the suffix tree, which is nice. I know that I need to annotate the suffix table with the number of descendant leaves from each node, then find nodes with >=k descendants, and finally find the deepest of those nodes. Theory-wise I'm set.

I've gotten a lot of help from the following resources (oops, I can only post 2):

  • Find longest repetitive sequence in a string
  • Depth-first search (Python)

I can get the paths from the root to each leaf, but I can't figure out how to pre-process the tree in such a way that I can get the number of descendants from each node. I have a separate algorithm that works on small sequences but it's in exponential complexity, so for larger stuff it takes way too long. I know with a DFS I should be able to perform the whole task in linear complexity. For this algorithm to work I need to be able to get the longest k-peat of an ~40,000 length string in less than 5 minutes.

Here's some sample data (first line: sequence, second line: k, suffix table format: parent child location length):

CATACATAC$
2
1 2 1 1
1 7 2 1
1 14 3 3
1 17 10 1
2 3 2 4
2 6 10 1
3 4 6 5
3 5 10 1
7 8 3 3
7 11 5 1
8 9 6 5
8 10 10 1
11 12 6 5
11 13 10 1
14 15 6 5
14 16 10 1

The output from this should be CATAC.

With the following code (modified from LiteratePrograms) I've been able to get the paths, but it still takes a long time on longer sequences to parse out a path for each node.

#authors listed at
#http://en.literateprograms.org/Depth-first_search_(Python)?action=history&offset=20081013235803
class Vertex:
    def __init__(self, data):
        self.data = data
        self.successors = []

def depthFirstSearch(start, isGoal, result):
    if start in result:
        return False

    result.append(start)

    if isGoal(start):
        return True
    for v in start.successors:
        if depthFirstSearch(v, isGoal, result):
            return True

    # No path was found
    result.pop()
    return False

def lrep(seq,reps,tree):
    n = 2 * len(seq) - 1
    v = [Vertex(i) for i in xrange(n)]
    edges = [(int(x[0]),int(x[1])) for x in tree]
    for a, b in edges:
        v[a].successors.append(v[b])

    paths = {}
    for x in v:
        result = []
        paths[x.data] = []
        if depthFirstSearch(v[1], (lambda v: v.data == x.data), result):
            path = [u.data for u in result]
            paths[x.data] = path

What I'd like to do is pre-process the tree to find nodes which satisfy the descendants >= k requirement prior to finding the depth. I haven't even gotten to how I'm going to calculate depth yet. Though I imagine I'll have some dictionary to keeps track of the depths of each node in the path then sums.

So, my first-most-important question is: "How do I preprocess the tree with descendant leaves?"

My second-less-important question is: "After that, how can I quickly compute depth?"

P.S. I should state that this isn't homework or anything of that sort. I'm just a biochemist trying to expand my horizons with some computational challenges.

like image 672
Gambrinus Avatar asked Nov 09 '12 15:11

Gambrinus


People also ask

How do you find the longest repeated substring?

The maximum value of LCSRe(i, j) provides the length of the longest repeating substring and the substring itself can be found using the length and the ending index of the common suffix.

How do you find the longest repeating substring in C++?

Longest Repeating Substring in C++ Suppose we have a string S, we have to find the length of the longest repeating substring(s). We will return 0 if no repeating substring is present. So if the string is like “abbaba”, then the output will be 2. As the longest repeating substring is “ab” or “ba”.


1 Answers

Nice question for an excercise in basic string-operations. I didnt remember the suffix-tree anymore ;) But as you have stated: theory-wise, you are set.

How do I preprocess the tree with descendant leaves?

The wikipedia-stub onto this topic is a bit confusing. You only need to know, if you are the outermost non-leaf-node with n >= k childs. If you found the substring from the root-node to this one in the whole string, the suffix-tree tells you, that there are n possible continuitations. So there must be n places, where this string occurs.

After that, how can I quickly compute depth?

A simple key-concept of this and many similar problems is to do a depth-first-search: In every Node, ask the child-elements for their value and return the maximum of it to the parent. The root-node will get the final result.

How the values are calculated differs between the problems. Here you have three possiblilitys for every node:

  1. The node have no childs. Its a leaf-node, the result is invalid.
  2. Every child returns an invalid result. Its the last non-leaf-node, the result is zero (no more characters after this node). If this node have n childs, the concated string of every edge from the root to this node appears n times in the whole string. If we need at least k nodes and k > n, the result is also invalid.
  3. One or more leafs return something valid. The result is the maximum of the returned value plus the length of the string attached the edge to it.

Of course, you also have to return the correspondending node. Otherwise you will know, how long the longest repeated substring is but not where it is.

Code

You should try to code this by yourself first. Constructing the tree is simple but not trivial if you want to gather all necessary informations. Nevertheless here is a simple example. Please note: every sanity-checking is dropped out and everything will fail horribly, if the input is somehow invalid. E.g. do not try to use any other root-index than one, do not refere to nodes as a parent, which weren't referenced as a childs before, etc. Much room for improvement *hint;)*.

class Node(object):
    def __init__(self, idx):
        self.idx = idx     # not needed but nice for prints 
        self.parent = None # edge to parent or None
        self.childs = []   # list of edges

    def get_deepest(self, k = 2):
        max_value = -1
        max_node = None
        for edge in self.childs:
            r = edge.n2.get_deepest()
            if r is None: continue # leaf
            value, node = r
            value += len(edge.s)
            if value > max_value: # new best result
                max_value = value
                max_node = node
        if max_node is None:
            # we are either a leaf (no edge connected) or 
            # the last non-leaf.
            # The number of childs have to be k to be valid.
            return (0, self) if len(self.childs) == k else None
        else:
            return (max_value, max_node)

    def get_string_to_root(self):
        if self.parent is None: return "" 
        return self.parent.n1.get_string_to_root() + self.parent.s

class Edge(object):
    # creating the edge also sets the correspondending
    # values in the nodes
    def __init__(self, n1, n2, s):
        #print "Edge %d -> %d [ %s]" % (n1.idx, n2.idx, s)
        self.n1, self.n2, self.s = n1, n2, s
        n1.childs.append(self)
        n2.parent = self

nodes = {1 : Node(1)} # root-node
string = sys.stdin.readline()
k = int(sys.stdin.readline())
for line in sys.stdin:
    parent_idx, child_idx, start, length = [int(x) for x in line.split()]
    s = string[start-1:start-1+length]
    # every edge constructs a Node
    nodes[child_idx] = Node(child_idx)
    Edge(nodes[parent_idx], nodes[child_idx], s)

(depth, node) = nodes[1].get_deepest(k)
print node.get_string_to_root()
like image 74
Peter Schneider Avatar answered Sep 21 '22 00:09

Peter Schneider