I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what's already been answered.
This is for the Rosalind project problem LREP. I'm trying to find the longest k-peated substring in a string and I've been provided the suffix tree, which is nice. I know that I need to annotate the suffix table with the number of descendant leaves from each node, then find nodes with >=k
descendants, and finally find the deepest of those nodes. Theory-wise I'm set.
I've gotten a lot of help from the following resources (oops, I can only post 2):
I can get the paths from the root to each leaf, but I can't figure out how to pre-process the tree in such a way that I can get the number of descendants from each node. I have a separate algorithm that works on small sequences but it's in exponential complexity, so for larger stuff it takes way too long. I know with a DFS I should be able to perform the whole task in linear complexity. For this algorithm to work I need to be able to get the longest k-peat of an ~40,000 length string in less than 5 minutes.
Here's some sample data (first line: sequence
, second line: k
, suffix table format: parent child location length
):
CATACATAC$
2
1 2 1 1
1 7 2 1
1 14 3 3
1 17 10 1
2 3 2 4
2 6 10 1
3 4 6 5
3 5 10 1
7 8 3 3
7 11 5 1
8 9 6 5
8 10 10 1
11 12 6 5
11 13 10 1
14 15 6 5
14 16 10 1
The output from this should be CATAC
.
With the following code (modified from LiteratePrograms) I've been able to get the paths, but it still takes a long time on longer sequences to parse out a path for each node.
#authors listed at
#http://en.literateprograms.org/Depth-first_search_(Python)?action=history&offset=20081013235803
class Vertex:
def __init__(self, data):
self.data = data
self.successors = []
def depthFirstSearch(start, isGoal, result):
if start in result:
return False
result.append(start)
if isGoal(start):
return True
for v in start.successors:
if depthFirstSearch(v, isGoal, result):
return True
# No path was found
result.pop()
return False
def lrep(seq,reps,tree):
n = 2 * len(seq) - 1
v = [Vertex(i) for i in xrange(n)]
edges = [(int(x[0]),int(x[1])) for x in tree]
for a, b in edges:
v[a].successors.append(v[b])
paths = {}
for x in v:
result = []
paths[x.data] = []
if depthFirstSearch(v[1], (lambda v: v.data == x.data), result):
path = [u.data for u in result]
paths[x.data] = path
What I'd like to do is pre-process the tree to find nodes which satisfy the descendants >= k
requirement prior to finding the depth. I haven't even gotten to how I'm going to calculate depth yet. Though I imagine I'll have some dictionary to keeps track of the depths of each node in the path then sums.
So, my first-most-important question is: "How do I preprocess the tree with descendant leaves?"
My second-less-important question is: "After that, how can I quickly compute depth?"
P.S. I should state that this isn't homework or anything of that sort. I'm just a biochemist trying to expand my horizons with some computational challenges.
The maximum value of LCSRe(i, j) provides the length of the longest repeating substring and the substring itself can be found using the length and the ending index of the common suffix.
Longest Repeating Substring in C++ Suppose we have a string S, we have to find the length of the longest repeating substring(s). We will return 0 if no repeating substring is present. So if the string is like “abbaba”, then the output will be 2. As the longest repeating substring is “ab” or “ba”.
Nice question for an excercise in basic string-operations. I didnt remember the suffix-tree anymore ;) But as you have stated: theory-wise, you are set.
The wikipedia-stub onto this topic is a bit confusing. You only need to know, if you are the outermost non-leaf-node with n >= k
childs. If you found the substring from the root-node to this one in the whole string, the suffix-tree tells you, that there are n
possible continuitations. So there must be n
places, where this string occurs.
A simple key-concept of this and many similar problems is to do a depth-first-search: In every Node, ask the child-elements for their value and return the maximum of it to the parent. The root-node will get the final result.
How the values are calculated differs between the problems. Here you have three possiblilitys for every node:
n
childs, the concated string of every edge from the root to this node appears n
times in the whole string. If we need at least k
nodes and k > n
, the result is also invalid.Of course, you also have to return the correspondending node. Otherwise you will know, how long the longest repeated substring is but not where it is.
You should try to code this by yourself first. Constructing the tree is simple but not trivial if you want to gather all necessary informations. Nevertheless here is a simple example. Please note: every sanity-checking is dropped out and everything will fail horribly, if the input is somehow invalid. E.g. do not try to use any other root-index than one, do not refere to nodes as a parent, which weren't referenced as a childs before, etc. Much room for improvement *hint;)*.
class Node(object):
def __init__(self, idx):
self.idx = idx # not needed but nice for prints
self.parent = None # edge to parent or None
self.childs = [] # list of edges
def get_deepest(self, k = 2):
max_value = -1
max_node = None
for edge in self.childs:
r = edge.n2.get_deepest()
if r is None: continue # leaf
value, node = r
value += len(edge.s)
if value > max_value: # new best result
max_value = value
max_node = node
if max_node is None:
# we are either a leaf (no edge connected) or
# the last non-leaf.
# The number of childs have to be k to be valid.
return (0, self) if len(self.childs) == k else None
else:
return (max_value, max_node)
def get_string_to_root(self):
if self.parent is None: return ""
return self.parent.n1.get_string_to_root() + self.parent.s
class Edge(object):
# creating the edge also sets the correspondending
# values in the nodes
def __init__(self, n1, n2, s):
#print "Edge %d -> %d [ %s]" % (n1.idx, n2.idx, s)
self.n1, self.n2, self.s = n1, n2, s
n1.childs.append(self)
n2.parent = self
nodes = {1 : Node(1)} # root-node
string = sys.stdin.readline()
k = int(sys.stdin.readline())
for line in sys.stdin:
parent_idx, child_idx, start, length = [int(x) for x in line.split()]
s = string[start-1:start-1+length]
# every edge constructs a Node
nodes[child_idx] = Node(child_idx)
Edge(nodes[parent_idx], nodes[child_idx], s)
(depth, node) = nodes[1].get_deepest(k)
print node.get_string_to_root()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With