Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest Non-Overlapping Repeated Substring using Suffix Tree/Array (Algorithm Only)

I need to find the longest non-overlapping repeated substring in a String. I have the suffix tree and suffix array of the string available.

When overlapping is allowed, the answer is trivial (deepest parent node in suffix tree).

For example for String = "acaca"

If overlapping is allowed, the answer is "aca" but when overlapping is not allowed, the answer is "ac" or "ca".

I need the algorithm or high level idea only.

P.S.: I tried but there is no clear answer I can find on web.

like image 966
Genuine Programmer Avatar asked Sep 30 '12 03:09

Genuine Programmer


People also ask

How do you find the longest repeated substring in a string?

Suppose we have a string S, we have to find the length of the longest repeating substring(s). We will return 0 if no repeating substring is present. So if the string is like “abbaba”, then the output will be 2. As the longest repeating substring is “ab” or “ba”.

What is non overlapping substring?

The substrings do not overlap, that is for any two substrings s[i.. j] and s[x..y] , either j < x or i > y is true. A substring that contains a certain character c must also contain all occurrences of c .

How does Ukkonen's algorithm work?

The algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string, adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property.


2 Answers

Generate suffix array and sort in O(nlogn).ps: There is more effective algorithm like DC3 and Ukkonen algorithm. example:

String : ababc Suffix array: start-index of substring | substring
0 - ababc
2 - abc
1 - babc
3 - bc
4 - c


compare each two consecutive substring and get common prefix with following constraint:
Say i1 is index of substring "ababc": 0
say i2 is index of substring "abc":2
common prefix is "ab" , length of common prefix is len


abs(i1-i2) >len //avoid overlap


go through suffix array with solution, and you will get the result of "ababc", which is "ab";

The whole solution will run O(nlogn)

However, there will be a special case: "aaaaa" that this solution can't solve thoroughly.
Welcome to discuss and come up to a solution of O(nlogn) but not O(n^2)

like image 87
Chris Su Avatar answered Sep 28 '22 07:09

Chris Su


Unfortunately, the solution proposed by Perkins will not work. We can't brute force our way through solutions to find a long repeated non-overlapping substring. Consider the suffix tree for banana: http://en.wikipedia.org/wiki/Suffix_tree. The "NA" branching node with "A" as its parent will be considered first, since it has the biggest length and is a branching node. But its constructed string "ANA" is overlapping, so it will be rejected. Now, the next node to consider with be "NA" which will show a non-overlapping length of 2, but substring "AN" will never be considered since it was already represented in the ANA string already considered. So if you're searching for all repeated non-overlapping substrings, or when there's a tie you want the first alphabetical one, you're out of luck.

Apparently there is an approach involving suffix trees that works, but the simpler approach is laid out here: http://rubyquiz.com/quiz153.html

Hope this helps!

like image 28
Adam Griffis Avatar answered Sep 28 '22 07:09

Adam Griffis