Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieving python 3.6 handling of re.sub() with zero length matches in python 3.7

handling of zero length matches has changed with python 3.7. Consider the following with python 3.6 (and previous):

>>> import re
>>> print(re.sub('a*', 'x', 'bac'))
xbxcx
>>> print(re.sub('.*', 'x', 'bac'))
x

We get the following with python 3.7:

>>> import re
>>> print(re.sub('a*', 'x', 'bac'))
xbxxcx
>>> print(re.sub('.*', 'x', 'bac'))
xx

I understand this is the standard behavior of PCRE. Furthermore, re.finditer() seems to have always detected the additional match:

>>> for m in re.finditer('a*', 'bac'):
...     print(m.start(0), m.end(0), m.group(0))
...
0 0
1 2 a
2 2
3 3

That said, I'm interested in retrieving the behavior of python 3.6 (this is for a hobby project implementing sed in python).

I can come with the following solution:

def sub36(regex, replacement, string):

    compiled = re.compile(regex)

    class Match(object):
        def __init__(self):
            self.prevmatch = None
        def __call__(self, match):
            try:
                if match.group(0) == '' and self.prevmatch and match.start(0) == self.prevmatch.end(0):
                    return ''
                else:
                    return re._expand(compiled, match, replacement)
            finally:
                self.prevmatch = match

    return compiled.sub(Match(), string)

which gives:

>>> print(re.sub('a*', 'x', 'bac'))
xbxxcx
>>> print(sub36('a*', 'x', 'bac'))
xbxcx
>>> print(re.sub('.*', 'x', 'bac'))
xx
>>> print(sub36('.*', 'x', 'bac'))
x

However, this seems very crafted for these examples.

What would be the right way to implement python 3.6 behavior for re.sub() zero length matches with python 3.7?

like image 466
Gilles Arcas Avatar asked Dec 05 '18 23:12

Gilles Arcas


1 Answers

Your solution may be in the regex egg:

Regex Egg Introduction

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality. The re module’s behaviour with zero-width matches changed in Python 3.7, and this module will follow that behaviour when compiled for Python 3.7.


Installation:

pip install regex

Usage:

With regex, you can specify the version (V0, V1) which regex pattern will be compiled with, i.e.:

# Python 3.7 and later
import regex
>>> regex.sub('.*', 'x', 'test')
'xx'
>>> regex.sub('.*?', '|', 'test')
'|||||||||'

# Python 3.6 and earlier
import regex
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'

Note:

Version can be indicated by VERSION0 or V0 flag, or (?V0) in the pattern.


Sources:

Regex thread - issue2636
regex 2018.11.22

like image 65
Pedro Lobito Avatar answered Sep 29 '22 13:09

Pedro Lobito