handling of zero length matches has changed with python 3.7. Consider the following with python 3.6 (and previous):
>>> import re
>>> print(re.sub('a*', 'x', 'bac'))
xbxcx
>>> print(re.sub('.*', 'x', 'bac'))
x
We get the following with python 3.7:
>>> import re
>>> print(re.sub('a*', 'x', 'bac'))
xbxxcx
>>> print(re.sub('.*', 'x', 'bac'))
xx
I understand this is the standard behavior of PCRE. Furthermore, re.finditer() seems to have always detected the additional match:
>>> for m in re.finditer('a*', 'bac'):
... print(m.start(0), m.end(0), m.group(0))
...
0 0
1 2 a
2 2
3 3
That said, I'm interested in retrieving the behavior of python 3.6 (this is for a hobby project implementing sed in python).
I can come with the following solution:
def sub36(regex, replacement, string):
compiled = re.compile(regex)
class Match(object):
def __init__(self):
self.prevmatch = None
def __call__(self, match):
try:
if match.group(0) == '' and self.prevmatch and match.start(0) == self.prevmatch.end(0):
return ''
else:
return re._expand(compiled, match, replacement)
finally:
self.prevmatch = match
return compiled.sub(Match(), string)
which gives:
>>> print(re.sub('a*', 'x', 'bac'))
xbxxcx
>>> print(sub36('a*', 'x', 'bac'))
xbxcx
>>> print(re.sub('.*', 'x', 'bac'))
xx
>>> print(sub36('.*', 'x', 'bac'))
x
However, this seems very crafted for these examples.
What would be the right way to implement python 3.6 behavior for re.sub() zero length matches with python 3.7?
Your solution may be in the regex egg:
Regex Egg Introduction
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality. The re module’s behaviour with zero-width matches changed in Python 3.7, and this module will follow that behaviour when compiled for Python 3.7.
Installation:
pip install regex
Usage:
With regex
, you can specify the version (V0
, V1
) which regex pattern will be compiled with, i.e.:
# Python 3.7 and later
import regex
>>> regex.sub('.*', 'x', 'test')
'xx'
>>> regex.sub('.*?', '|', 'test')
'|||||||||'
# Python 3.6 and earlier
import regex
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'
Note:
Version can be indicated by
VERSION0
orV0
flag, or(?V0)
in the pattern.
Sources:
Regex thread - issue2636
regex 2018.11.22
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With