Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find longest repetitive sequence in a string

I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:

fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld

then I would like the value "helloworld" to be returned.

I know of a few ways of accomplishing this but the problem I'm facing is that the actual string is absurdly large so I'm really looking for a method that can do it in a timely fashion.

like image 978
Snesticle Avatar asked Jun 18 '12 20:06

Snesticle


People also ask

How do you find the repeating sequence of a string in Python?

(1) First generate the possible sub-strings you want to search in each string. Is there a min or max length? Build a list or set of sub-strings from the input string. (2) Once you have the sub-strings to search for, try to identify the unique locations within the input string where the substrings appear.


1 Answers

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).

That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.

Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.

There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.

Hope this helps!

like image 193
templatetypedef Avatar answered Sep 30 '22 06:09

templatetypedef