From my understanding,
(.)(?<!\1)
should never match. Actually, php's preg_replace
even refuses to compile this and so does ruby's gsub
. The python re
module seems to have a different opinion though:
import re
test = 'xAAAAAyBBBBz'
print (re.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
Result:
(x)AAAA(A)(y)BBB(B)(z)
Can anyone provide a reasonable explanation for this behavior?
This behavior appears to be a limitation in the re
module. The alternative regex
module seems to handle groups in assertions correctly:
import regex
test = 'xAAAAAyBBBBz'
print (regex.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
## xAAAAAyBBBBz
print (regex.sub(r'(.)(.)(?<!\1)', r'(\g<0>)', test))
## (xA)AAA(Ay)BBB(Bz)
Note that unlike pcre
, regex
also allows variable-width lookbehinds:
print (regex.sub(r'(.)(?<![A-Z]+)', r'(\g<0>)', test))
## (x)AAAAA(y)BBBB(z)
Eventually, regex
is going to be included in the standard library, as mentioned in PEP 411.
The good news is that you can use lookbehind anywhere in the regex, not only at the start.
In this type of lookahead the regex engine searches for a particular element which may be a character or characters or a group after the item matched. If that particular element is not present then the regex declares the match as a match otherwise it simply rejects that match.
A negative lookbehind assertion asserts true if the pattern inside the lookbehind is not matched.
This does look like a limitation (nice way of saying "bug", as I learned from a support call with Microsoft) in the Python re
module.
I guess it has to do with the fact that Python does not support variable-length lookbehind assertions, but it's not clever enough to figure out that \1
will always be fixed-length. Why it doesn't complain about this when compiling the regex, I can't say.
Funnily enough:
>>> print (re.sub(r'.(?<!\0)', r'(\g<0>)', test))
(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)
>>>
>>> re.compile(r'(.*)(?<!\1)') # This should trigger an error but doesn't!
<_sre.SRE_Pattern object at 0x00000000026A89C0>
So better don't use backreferences in lookbehind assertions in Python. Positive lookbehind isn't much better (it also matches here as if it was a positive lookahead):
>>> print (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))
x(A)(A)(A)(A)Ay(B)(B)(B)Bz
And I can't even guess what's going on here:
>>> print (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))
x(AA)(A)(A)Ay(BB)(B)Bz
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With