How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1}
does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4
and the string xxaaaayyybbbbbzzccccxx
, the expressions should match aaaa
and cccc
and not bbbb
.
I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.
Use negative lookahead and negative lookbehind.
This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1)
except that Python's re module is broken (see this link).
English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."
Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:
The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"
This works around that bug in the re module by using (?=\1)..
instead of \1.
(these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.
Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With