Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Variable-length replacement with `re.sub()`

Tags:

python

regex

I would like to replace all occurrences of 3 or more "=" with an equal-number of "-".

def f(a, b):
    '''
    Example
    =======
    >>> from x import y
    '''
    return a == b

becomes

def f(a, b):
    '''
    Example
    -------
    >>> from x import y
    '''
    return a == b        # don't touch

My working but hacky solution is to pass a lambda to repl from re.sub() that grabs the length of each match:

>>> import re

>>> s = """
... def f(a, b):
...     '''
...     Example
...     =======
...     >>> from x import y
...     '''
...     return a == b"""

>>> eq = r'(={3,})'
>>> print(re.sub(eq, lambda x: '-' * (x.end() - x.start()), s))

def f(a, b):
    '''
    Example
    -------
    >>> from x import y
    '''
    return a == b

Can I do this without needing to pass a function to re.sub()?

My thinking would be that I'd need r'(=){3,}' (a variable-length capturing group), but re.sub(r'(=){3,}', '-', s) has a problem with greediness, I believe.

Can I modify the regex eq above so that the lambda isn't needed?

like image 415
Brad Solomon Avatar asked Mar 24 '18 19:03

Brad Solomon


5 Answers

With some help from lookahead/lookbehind it is possible to replace by char:

>>> re.sub("(=(?===)|(?<===)=|(?<==)=(?==))", "-", "=== == ======= asdlkfj")
... '--- == ------- asdlkfj'
like image 66
Marat Avatar answered Nov 11 '22 12:11

Marat


Using re.sub, this uses some deceptive lookahead trickery and works assuming your pattern-to-replace is always followed by a newline '\n'.

print(re.sub('=(?=={2}|=?\n)', '-',  s))
def f(a, b):
    '''
    Example
    -------
    >>> from x import y
    '''
    return a == b

Details
"Replace an equal sign if it is succeeded by two equal signs or an optional equal sign and newline."

=        # equal sign if
(?=={2}  # lookahead
|        # regex OR
=?       # optional equal sign
\n       # newline
)
like image 33
cs95 Avatar answered Nov 11 '22 10:11

cs95


It's possible, but not advisable.

The way re.sub works is that it finds a complete match and then it replaces it. It doesn't replace each capture group separately, so things like re.sub(r'(=){3,}', '-', s) won't work - that'll replace the entire match with a dash, not each occurence of the = character.

>>> re.sub(r'(=){3,}', '-', '=== ===')
'- -'

So if you want to avoid a lambda, you have to write a regex that matches individual = characters - but only if there's at least 3 of them. This is, of course, much more difficult than simply matching 3 or more = characters with the simple pattern ={3,}. It requires some use of lookarounds and looks like this:

(?<===)=|(?<==)=(?==)|=(?===)

This does what you want:

>>> re.sub(r'(?<===)=|(?<==)=(?==)|=(?===)', '-', '= == === ======')
'= == --- ------'

But it's clearly much less readable than the original lambda solution.

like image 44
Aran-Fey Avatar answered Nov 11 '22 10:11

Aran-Fey


Using the regex module, you can write:

regex.sub(r'\G(?!\A)=|=(?===)', '-', s)
  • \G is the position immediately after the last successful match or the start of the string.
  • (?!\A) forces the start of the string to fail.

The second branch =(?===) succeeds when a = is followed by two other =. Then the next matches use the first branch \G(?!\A)= until there are no more consecutive =.

demo

like image 2
Casimir et Hippolyte Avatar answered Nov 11 '22 12:11

Casimir et Hippolyte


The question explicitly asks for a solution that doesn't use a function, but for completeness and for someone who is looking for a clearer solution (that doesn't involve lots of regex tricks), it's possible to use a function as in Replacing a RegEx with a string of characters with the same length:

re.sub('={3,}', lambda x: '-' * len(x.group()), s)

like image 2
cookiemonster Avatar answered Nov 11 '22 10:11

cookiemonster