Finding repetitions in a string

Question

For two days I have been researching for this and have not found anything so I decided to write my own string repetition detector. Basically the function

def findRepetitions (string):

would receive a string and search for any repetitions; returns a list of strings reduced to their simplest form.

For a sample, it'd be:

findRepetitions ("trololololo") --> ["olo"]
findRepetitions ("bookkeeper") ---> ["o", "k", "e"]
findRepetitions ("Hello, Molly") -> ["l", "l"]
findRepetitions ("abcdefgh") -----> []
findRepetitions ("102102102") ----> ["102"]

In the third example, the function returns ["l", "l"] instead of ["ll"], because I want to search for repetitions only in the neighboring characters.

I know that this may be hard, but I've been literally thinking over this for a long time and cannot find any smart solution to this.

igon · Accepted Answer

This is a well known problem:

http://en.wikipedia.org/wiki/Longest_repeated_substring_problem

You can solve this problem efficiently but constructing a trie:

http://en.wikipedia.org/wiki/Radix_tree

The wiki page shows pseudocode and examples for lookup and addition of nodes which are the only functions that you need. Insert strings in the trie starting at every character, e.g. for string abcd insert abcd, bcd, cd, d. This particular instance of a trie is called a "suffix tree":

http://en.wikipedia.org/wiki/Suffix_tree

Every time you traverse an already estabilished path you are in fact discovering a repetition in your string. Now, you can list all the repetions in a separate data structure and extract the longest one (if necessary).

kreativitea · Answer

Your examples are inconsistent. For example, olo does not repeat, like the l in Hello, Molly, in `trololololo; there's an l between instances. Sequential repeats in trololololo are lolo, lo, olol, and ol. Are you asking for a 'greedy' algorithm? So, given trololololo, it would return olol?

In any case, here's a bit of code.

from collections import Counter

def find_repetition(p):
    """ Returns a lookup dictionary for repetitions. """ 
    lookup = Counter()
    while len(p) != 0:
        for i in xrange(len(p)):
            lookup[p[0:i]] += 1
        p = p[1:]
    return lookup

def repeats(p):
    a = find_repetition(p)
    rs = [i for i in a if a[i] > 1][1:]
    return [r for r in rs if r*2 in p]

If you want it to be 'greedy' like I described, you have to add in another function that takes the results from repeats and chomps away at your string when it finds a match.

For now, the results look like this:

test = "trololololo", "bookkeeper", "Hello, Molly", "abcdefgh", "102102102"

>>> for i in test:
>>>     repeats(i)

['lolo', 'lo', 'olol', 'ol']
['e', 'o', 'k']
['l']
[]
['210', '021', '102']

warning

find_repetition is not very quick, since it basically generates all length combinations of the string and throws them into a Counter object.

Finding repetitions in a string

Tags:

python

string

Berk Özbalcı

2 Answers

igon

kreativitea

Recent Activity

Donate For Us

Finding repetitions in a string

Tags:

python

string

Berk Özbalcı

2 Answers

igon

kreativitea

Related questions

Recent Activity

Donate For Us