Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match same number of repetitions of character as repetitions of captured group

I would like to clean some input that was logged from my keyboard with python and regex. Especially when backspace was used to fix a mistake.

Example 1:

[in]:  'Helloo<BckSp> world'
[out]: 'Hello world'

This can be done with

re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')

Example 2:
However when I have several backspaces, I don't know how to delete exactly the same number of characters before:

[in]:  'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'

(Here I want to remove 'l' and 'o' before the two backspaces).

I could simply use re.sub(r'[^>]<BckSp>', '', line) several times until there is no <BckSp> left but I would like to find a more elegant / faster solution.

Does anyone know how to do this ?

like image 763
Louis M Avatar asked Dec 27 '16 10:12

Louis M


People also ask

How do you repeat a group in regex?

For example, you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of "ab".

What is match and group in regex?

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.

Which symbol is used in regular expressions which will repeat the previous character one or more number of times?

A repeat is an expression that is repeated an arbitrary number of times. An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once.


2 Answers

It looks like Python does not support recursive regex. If you can use another language, you could try this:

.(?R)?<BckSp>

See: https://regex101.com/r/OirPNn/1

like image 198
Fallenhero Avatar answered Oct 07 '22 21:10

Fallenhero


It isn't very efficient but you can do that with the re module:

(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1

demo

This way you don't have to count, the pattern only uses the repetition.

(?: 
    [^<] # a character to remove
    (?=  # lookahead to reach the corresponding <BckSp>
        [^<]* # skip characters until the first <BckSp>
        (  # capture group 1: contains the <BckSp>s
            (?=(\1?))\2 # emulate an atomic group in place of \1?+
                        # The idea is to add the <BcKSp>s already matched in the
                        # previous repetitions if any to be sure that the following
                        # <BckSp> isn't already associated with a character
            <BckSp> # corresponding <BckSp>
        )
    )
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>

\1 # matches all the consecutive <BckSp> and ensures that there's no more character
   # between the last character to remove and the first <BckSp>

You can do the same with the regex module, but this time you don't need to emulate the possessive quantifier:

(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1

demo

But with the regex module, you can also use the recursion (as @Fallenhero noticed it):

[^<](?R)?<BckSp>

demo

like image 44
Casimir et Hippolyte Avatar answered Oct 07 '22 21:10

Casimir et Hippolyte