My users insert sequences like <pre class="prettyprint"><code>________________________ ************************ ------------------------ &hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts; </code></pre> to format documents (dont ask me about my users!). And it looks bad when displaying snippets. How can I remove repetitions of any characters? I can add individual filters, but it will be a constant cat and mouse game. Can a regular expression filter these?

In python (but the logic is the same regardless of the language): <pre class="prettyprint"><code>>>> import re >>> text = ''' ... This is some text ... ________________________ ... This some more ... &hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts; ... Truly the last line ... ''' >>> print re.sub(r'[_&hearts;]{2,}', '', text) #this is the core (regexp) This is some text This some more Truly the last line </code></pre> This has the advantage that you have some control on what to substitute and what not (for example you might wish not to substitute <code>.</code> as it could be part of a comment like <code>This is still to do...</code>. EDIT: If your repetitions are always "lines" you could add the newline characters to your expression: <pre class="prettyprint"><code>text = ''' This is some text ________________________ This some more &hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts;&hearts; Truly the last line But this is not to be changed: &hearts;&hearts;&hearts; ''' >>> print re.sub(r'\n[_&hearts;]{2,}\n', '\n', text) This is some text This some more Truly the last line But this is not to be changed: &hearts;&hearts;&hearts; </code></pre> HTH

Can regular expressions find repetitions of characters?

Tags:

regex

My users insert sequences like

________________________
************************
------------------------
♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥

to format documents (dont ask me about my users!). And it looks bad when displaying snippets. How can I remove repetitions of any characters? I can add individual filters, but it will be a constant cat and mouse game.

Can a regular expression filter these?

367

asked Oct 12 '11 06:10

Jesvin Jose

3 Answers

Try something like:

(.)\1{5,}

Which matches any character, then 5 or more of that character. Remember to escape the \ if your language uses strings for regex patterns!

124

answered Oct 29 '22 23:10

Simon Buchan

You can remove repetitions of any character with a simple regex like (.)\1+

However, this will catch legitimate uses as well, such as words that have doubled letters in their spelling (balloon, spelling, well, etc).

So, you'd probably want to restrict the expression to some disallowed characters, after all, while keeping it as generic as possible, in order not to have to modify it from time to time, as your users find new characters to use.
One possible solution would be to disallow repeated non-letter and non-number characters:

([^A-Za-z0-9])\1+

But even this is not a definitive solution to all the cases, as some of your users may actually decide to use actual letter sequences as delimiters:

ZZZZZZZZZZZZZZZZZZZZZZ
BBBBBBBBBBBBBBBBBBBBBB
ZZZZZZZZZZZZZZZZZZZZZZ

In order not to allow this and with the added benefit of allowing legitimate uses of some repeated non-letter characters (such as in an ellipsis: ...), you could restrict the character repetitions to a maximum of 3, by using a regex with the syntax (<pattern>)\1{min, max} like this: (.)\1{4,} to match offending character sequences, with a minimum length of 4 and an unspecified maximum.

answered Oct 29 '22 23:10

luvieere

In python (but the logic is the same regardless of the language):

>>> import re
>>> text = '''
... This is some text
... ________________________
... This some more
... ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
... Truly the last line
... '''
>>> print re.sub(r'[_♥]{2,}', '', text)  #this is the core (regexp)

This is some text

This some more

Truly the last line

This has the advantage that you have some control on what to substitute and what not (for example you might wish not to substitute . as it could be part of a comment like This is still to do....

EDIT:

If your repetitions are always "lines" you could add the newline characters to your expression:

text = '''
This is some text
________________________
This some more
♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥
Truly the last line
But this is not to be changed: ♥♥♥
'''
>>> print re.sub(r'\n[_♥]{2,}\n', '\n', text)
This is some text
This some more
Truly the last line
But this is not to be changed: ♥♥♥

HTH

answered Oct 29 '22 23:10

mac

Related questions
                            
                                Zero-Length regexes and infinite matches?
                            
                                Extract last character of string with ansible and regex_replace filter
                            
                                Bash regex string variable match
                            
                                Format float to show decimal places if is not zero java [duplicate]
                            
                                Removing parenthesis in R
                            
                                Regex to find any double quoted string within a string
                            
                                Split a character to letters and numbers
                            
                                perl6 Regex match conjunction &&
                            
                                Using 'after' as lookbehind in a grammar in raku
                            
                                I'm using Python regexes in a criminally inefficient manner
                            
                                How to extract usernames out of Tweets?
                            
                                Is there a fairly simple way for a script to tell (from context) whether "her" is a possessive pronoun?
                            
                                Javascript regex - how to get text between curly brackets
                            
                                how to extract body contents using regexp [duplicate]
                            
                                C# Convert Relative to Absolute Links in HTML String
                            
                                Change single variable value in querystring [closed]
                            
                                TCL expect regular expression
                            
                                Regex to validate string having only characters (not special characters), blank spaces and numbers
                            
                                Javascript to capitalize the next char after " Mc"
                            
                                Regex: Remove everything except allowed characters. how?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can regular expressions find repetitions of characters?

Tags:

regex

Jesvin Jose

People also ask

3 Answers

Simon Buchan

luvieere

mac

Recent Activity

Donate For Us