Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impossible lookbehind with a backreference

From my understanding,

(.)(?<!\1)

should never match. Actually, php's preg_replace even refuses to compile this and so does ruby's gsub. The python re module seems to have a different opinion though:

import re
test = 'xAAAAAyBBBBz'
print (re.sub(r'(.)(?<!\1)', r'(\g<0>)', test))

Result:

(x)AAAA(A)(y)BBB(B)(z)

Can anyone provide a reasonable explanation for this behavior?

Update

This behavior appears to be a limitation in the re module. The alternative regex module seems to handle groups in assertions correctly:

import regex

test = 'xAAAAAyBBBBz'

print (regex.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
## xAAAAAyBBBBz

print (regex.sub(r'(.)(.)(?<!\1)', r'(\g<0>)', test))
## (xA)AAA(Ay)BBB(Bz)

Note that unlike pcre, regex also allows variable-width lookbehinds:

print (regex.sub(r'(.)(?<![A-Z]+)', r'(\g<0>)', test))
## (x)AAAAA(y)BBBB(z)

Eventually, regex is going to be included in the standard library, as mentioned in PEP 411.

like image 632
georg Avatar asked Apr 23 '12 10:04

georg


People also ask

Can I use regex Lookbehind?

The good news is that you can use lookbehind anywhere in the regex, not only at the start.

What is a negative lookahead?

In this type of lookahead the regex engine searches for a particular element which may be a character or characters or a group after the item matched. If that particular element is not present then the regex declares the match as a match otherwise it simply rejects that match.

What is negative look behind?

A negative lookbehind assertion asserts true if the pattern inside the lookbehind is not matched.


1 Answers

This does look like a limitation (nice way of saying "bug", as I learned from a support call with Microsoft) in the Python re module.

I guess it has to do with the fact that Python does not support variable-length lookbehind assertions, but it's not clever enough to figure out that \1 will always be fixed-length. Why it doesn't complain about this when compiling the regex, I can't say.

Funnily enough:

>>> print (re.sub(r'.(?<!\0)', r'(\g<0>)', test))
(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)
>>>
>>> re.compile(r'(.*)(?<!\1)') # This should trigger an error but doesn't!
<_sre.SRE_Pattern object at 0x00000000026A89C0>

So better don't use backreferences in lookbehind assertions in Python. Positive lookbehind isn't much better (it also matches here as if it was a positive lookahead):

>>> print (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))
x(A)(A)(A)(A)Ay(B)(B)(B)Bz

And I can't even guess what's going on here:

>>> print (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))
x(AA)(A)(A)Ay(BB)(B)Bz
like image 104
Tim Pietzcker Avatar answered Sep 29 '22 08:09

Tim Pietzcker