I have a string that may have a repeated character pattern, e.g.
'xyzzyxxyzzyxxyzzyx'
I need to write a regex that would replace such string with its smallest repeated pattern:
'xyzzyxxyzzyxxyzzyx' becomes 'xyzzyx',
'abcbaccbaabcbaccbaabcbaccba' becomes 'abcbaccba'
Use the following:
> re.sub(r'(.+?)\1+', r'\1', 'xyzzyxxyzzyxxyzzyx')
'xyzzyx'
> re.sub(r'(.+?)\1+', r'\1', 'abcbaccbaabcbaccbaabcbaccba')
'abcbaccba'
> re.sub(r'(.+?)\1+', r'\1', 'iiiiiiiiiiiiiiiiii')
'i'
It basically matches a pattern that repeats itself (.+?)\1+
, and removes everything but the repeating pattern, which is captured in the first group \1
. Also note that using a reluctant qualifier here, i.e., +?
will make the regex backtrack quite a lot.
DEMO.
Since you want the smallest repeating pattern, something like the following should work for you:
re.sub(r'^(.+?)\1+$', r'\1', input_string)
The ^
and $
anchors make sure you don't get matches in the middle of the string, and by using .+?
instead of just .+
you will get the shortest pattern (compare results using a string like 'aaaaaaaaaa'
).
Try this regex pattern and capture the first group:
^(.+?)\1+$
^
anchor for beginning of string/line.
any character except newlines+
quantifier to denote atleast 1 occurence?
makes the +
lazy instead of greedy, hence giving you the shortest pattern()
capturing group\1+
backreference with quantifier to denote that pattern should
repeat atleast once$
anchor for end of string/lineTest it here: Rubular
The above solution does a lot of backtracking affecting performance. If you know the which characters are not allowed in these strings, then you can use a negated characted set which eliminates backtracking. For e.g., if whitespaces are not allowed, then
^([^\s]+)\1+$
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With