I am looking for an efficient way to extract the shortest repeating substring. For example: <pre class="prettyprint"><code>input1 = 'dabcdbcdbcdd' ouput1 = 'bcd' input2 = 'cbabababac' output2 = 'ba' </code></pre> I would appreciate any answer or information related to the problem. Also, in this post, people suggest that we can use the regular expression like <pre class="prettyprint"><code>re=^(.*?)\1+$ </code></pre> to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?). --- follow up --- Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.

A quick fix for this pattern could be <pre class="prettyprint"><code>(.+?)\1+ </code></pre> Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like <code>abcabcabc</code> but not <code>xabcabcabcx</code>. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore <code>.+?</code> instead of <code>.*?</code>. In Python: <pre class="prettyprint"><code>>>> import re >>> r = re.compile(r"(.+?)\1+") >>> r.findall("cbabababac") ['ba'] >>> r.findall("dabcdbcdbcdd") ['bcd'] </code></pre> But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution <code>d</code> will not be found although that is the shortest repeating string. Or see this example: here it can't find <code>abcd</code> because the <code>abc</code> part of the first <code>abcd</code> has been used up in the first match): <pre class="prettyprint"><code>>>> r.findall("abcabcdabcd") ['abc'] </code></pre> Also, it may return several matches, so you'd need to find the shortest one in a second step: <pre class="prettyprint"><code>>>> r.findall("abcdabcdabcabc") ['abcd', 'abc'] </code></pre> Better solution: To allow the engine to also find overlapping matches, use <pre class="prettyprint"><code>(.+?)(?=\1) </code></pre> This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings: <pre class="prettyprint"><code>>>> r = re.compile(r"(.+?)(?=\1)") >>> r.findall("dabcdbcdbcdd") ['bcd', 'bcd', 'd'] </code></pre> Therefore, you should sort the results by length and return the shortest one: <pre class="prettyprint"><code>>>> min(r.findall("dabcdbcdbcdd") or [""], key=len) 'd' </code></pre> The <code>or [""]</code> (thanks to J. F. Sebastian!) ensures that no <code>ValueError</code> is triggered if there's no match at all.

Shortest Repeating Sub-String

Tags:

python

string-matching

regex

I am looking for an efficient way to extract the shortest repeating substring. For example:

input1 = 'dabcdbcdbcdd'
ouput1 = 'bcd'

input2 = 'cbabababac'
output2 = 'ba'

I would appreciate any answer or information related to the problem.

Also, in this post, people suggest that we can use the regular expression like

re=^(.*?)\1+$

to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?).

--- follow up ---

Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.

835

asked Dec 26 '11 08:12

TimC

1 Answers

A quick fix for this pattern could be

(.+?)\1+

Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc but not xabcabcabcx. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+? instead of .*?.

In Python:

>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']

But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd because the abc part of the first abcd has been used up in the first match):

>>> r.findall("abcabcdabcd")
['abc']

Also, it may return several matches, so you'd need to find the shortest one in a second step:

>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']

Better solution:

To allow the engine to also find overlapping matches, use

(.+?)(?=\1)

This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:

>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']

Therefore, you should sort the results by length and return the shortest one:

>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'

The or [""] (thanks to J. F. Sebastian!) ensures that no ValueError is triggered if there's no match at all.

answered Sep 21 '22 23:09

Tim Pietzcker

Related questions
                            
                                Windows progress bar in python's Tkinter
                            
                                Django forms want to auto-save user, client and datetime
                            
                                Python pickle crash when trying to return default value in __getattr__
                            
                                Can't dump or write an ElementTree element
                            
                                convert ahk to python
                            
                                All but the last N elements of iterator in Python
                            
                                Check string for numbers in Python
                            
                                Using a global flag for python RegExp compile
                            
                                python regex - what does - (dash) mean
                            
                                Has anyone been able to write out UTF-8 characters using python's xlwt?
                            
                                Sort a list of sets
                            
                                Python generator pre-fetch?
                            
                                python json boolean to lowercase string
                            
                                Unable to use wx.NotificationMessage properly with wxPython
                            
                                How to test equivalence of ranges
                            
                                What is a more succinct way of converting python boolean to javascript boolean literals?
                            
                                Getting the subsets of a set in Python
                            
                                Connect to putty and type few command
                            
                                Content of infobox of Wikipedia
                            
                                Using argparse in conjunction with sys.argv in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With