For two days I have been researching for this and have not found anything so I decided to write my own string repetition detector. Basically the function
def findRepetitions (string):
would receive a string and search for any repetitions; returns a list of strings reduced to their simplest form.
For a sample, it'd be:
findRepetitions ("trololololo") --> ["olo"]
findRepetitions ("bookkeeper") ---> ["o", "k", "e"]
findRepetitions ("Hello, Molly") -> ["l", "l"]
findRepetitions ("abcdefgh") -----> []
findRepetitions ("102102102") ----> ["102"]
In the third example, the function returns ["l", "l"] instead of ["ll"], because I want to search for repetitions only in the neighboring characters.
I know that this may be hard, but I've been literally thinking over this for a long time and cannot find any smart solution to this.
This is a well known problem:
http://en.wikipedia.org/wiki/Longest_repeated_substring_problem
You can solve this problem efficiently but constructing a trie:
http://en.wikipedia.org/wiki/Radix_tree
The wiki page shows pseudocode and examples for lookup and addition of nodes which are the only functions that you need. Insert strings in the trie starting at every character, e.g. for string abcd insert abcd, bcd, cd, d. This particular instance of a trie is called a "suffix tree":
http://en.wikipedia.org/wiki/Suffix_tree
Every time you traverse an already estabilished path you are in fact discovering a repetition in your string. Now, you can list all the repetions in a separate data structure and extract the longest one (if necessary).
Your examples are inconsistent. For example, olo does not repeat, like the l in Hello, Molly, in `trololololo; there's an l between instances. Sequential repeats in trololololo are lolo, lo, olol, and ol. Are you asking for a 'greedy' algorithm? So, given trololololo, it would return olol?
In any case, here's a bit of code.
from collections import Counter
def find_repetition(p):
""" Returns a lookup dictionary for repetitions. """
lookup = Counter()
while len(p) != 0:
for i in xrange(len(p)):
lookup[p[0:i]] += 1
p = p[1:]
return lookup
def repeats(p):
a = find_repetition(p)
rs = [i for i in a if a[i] > 1][1:]
return [r for r in rs if r*2 in p]
If you want it to be 'greedy' like I described, you have to add in another function that takes the results from repeats and chomps away at your string when it finds a match.
For now, the results look like this:
test = "trololololo", "bookkeeper", "Hello, Molly", "abcdefgh", "102102102"
>>> for i in test:
>>> repeats(i)
['lolo', 'lo', 'olol', 'ol']
['e', 'o', 'k']
['l']
[]
['210', '021', '102']
warning
find_repetition is not very quick, since it basically generates all length combinations of the string and throws them into a Counter object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With