I would like to clean some input that was logged from my keyboard with python and regex. Especially when backspace was used to fix a mistake.
Example 1:
[in]: 'Helloo<BckSp> world'
[out]: 'Hello world'
This can be done with
re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')
Example 2:
However when I have several backspaces, I don't know how to delete exactly the same number of characters before:
[in]: 'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'
(Here I want to remove 'l' and 'o' before the two backspaces).
I could simply use re.sub(r'[^>]<BckSp>', '', line)
several times until there is no <BckSp>
left but I would like to find a more elegant / faster solution.
Does anyone know how to do this ?
For example, you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of "ab".
Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.
A repeat is an expression that is repeated an arbitrary number of times. An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once.
It looks like Python does not support recursive regex. If you can use another language, you could try this:
.(?R)?<BckSp>
See: https://regex101.com/r/OirPNn/1
It isn't very efficient but you can do that with the re module:
(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1
demo
This way you don't have to count, the pattern only uses the repetition.
(?:
[^<] # a character to remove
(?= # lookahead to reach the corresponding <BckSp>
[^<]* # skip characters until the first <BckSp>
( # capture group 1: contains the <BckSp>s
(?=(\1?))\2 # emulate an atomic group in place of \1?+
# The idea is to add the <BcKSp>s already matched in the
# previous repetitions if any to be sure that the following
# <BckSp> isn't already associated with a character
<BckSp> # corresponding <BckSp>
)
)
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>
\1 # matches all the consecutive <BckSp> and ensures that there's no more character
# between the last character to remove and the first <BckSp>
You can do the same with the regex module, but this time you don't need to emulate the possessive quantifier:
(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1
demo
But with the regex module, you can also use the recursion (as @Fallenhero noticed it):
[^<](?R)?<BckSp>
demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With