Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shortest Repeating Sub-String

I am looking for an efficient way to extract the shortest repeating substring. For example:

input1 = 'dabcdbcdbcdd'
ouput1 = 'bcd'

input2 = 'cbabababac'
output2 = 'ba'

I would appreciate any answer or information related to the problem.

Also, in this post, people suggest that we can use the regular expression like

re=^(.*?)\1+$

to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?).

--- follow up ---

Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.

like image 835
TimC Avatar asked Dec 26 '11 08:12

TimC


People also ask

How do you find a repeating substring in Python?

Python has a built-in function for counting the repeated substring in a given string called count(). As the name suggests, it counts the occurrence of a substring in a given string.

Is a substring consecutive?

A substring repeats consecutively if it appears multiple times in a string, with no gaps between its occurrences. For example, the string ``BLAAABADBADX" has two repeating substrings: 'A', which repeats three times, and 'BAD', which repeats twice.


1 Answers

A quick fix for this pattern could be

(.+?)\1+

Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc but not xabcabcabcx. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+? instead of .*?.

In Python:

>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']

But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd because the abc part of the first abcd has been used up in the first match):

>>> r.findall("abcabcdabcd")
['abc']

Also, it may return several matches, so you'd need to find the shortest one in a second step:

>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']

Better solution:

To allow the engine to also find overlapping matches, use

(.+?)(?=\1)

This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:

>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']

Therefore, you should sort the results by length and return the shortest one:

>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'

The or [""] (thanks to J. F. Sebastian!) ensures that no ValueError is triggered if there's no match at all.

like image 55
Tim Pietzcker Avatar answered Sep 21 '22 23:09

Tim Pietzcker