Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to remove repeated character pattern in a string

Tags:

python

regex

I have a string that may have a repeated character pattern, e.g.

'xyzzyxxyzzyxxyzzyx'

I need to write a regex that would replace such string with its smallest repeated pattern:

'xyzzyxxyzzyxxyzzyx' becomes 'xyzzyx',

'abcbaccbaabcbaccbaabcbaccba' becomes 'abcbaccba'
like image 305
mercador Avatar asked Sep 17 '12 23:09

mercador


3 Answers

Use the following:

> re.sub(r'(.+?)\1+', r'\1', 'xyzzyxxyzzyxxyzzyx')
'xyzzyx'
> re.sub(r'(.+?)\1+', r'\1', 'abcbaccbaabcbaccbaabcbaccba')
'abcbaccba'
> re.sub(r'(.+?)\1+', r'\1', 'iiiiiiiiiiiiiiiiii')
'i'

It basically matches a pattern that repeats itself (.+?)\1+, and removes everything but the repeating pattern, which is captured in the first group \1. Also note that using a reluctant qualifier here, i.e., +? will make the regex backtrack quite a lot.

DEMO.

like image 107
João Silva Avatar answered Oct 13 '22 21:10

João Silva


Since you want the smallest repeating pattern, something like the following should work for you:

re.sub(r'^(.+?)\1+$', r'\1', input_string)

The ^ and $ anchors make sure you don't get matches in the middle of the string, and by using .+? instead of just .+ you will get the shortest pattern (compare results using a string like 'aaaaaaaaaa').

like image 20
Andrew Clark Avatar answered Oct 13 '22 21:10

Andrew Clark


Try this regex pattern and capture the first group:

^(.+?)\1+$
  • ^ anchor for beginning of string/line
  • . any character except newlines
  • + quantifier to denote atleast 1 occurence
  • ? makes the + lazy instead of greedy, hence giving you the shortest pattern
  • () capturing group
  • \1+ backreference with quantifier to denote that pattern should repeat atleast once
  • $ anchor for end of string/line

Test it here: Rubular


The above solution does a lot of backtracking affecting performance. If you know the which characters are not allowed in these strings, then you can use a negated characted set which eliminates backtracking. For e.g., if whitespaces are not allowed, then

^([^\s]+)\1+$
like image 44
Kash Avatar answered Oct 13 '22 20:10

Kash