Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing named capturing groups with re.sub

I want to replace the text of matched re patterns in a string and can do this using re.sub(). If I pass it a function as the repl argument in the call it works as desired, as illustrated below:

from __future__ import print_function
import re

pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'

my_str = "Here's some <first>sample stuff</first> in the " \
            "<second>middle</second> of some other text."

def replace(m):
    return ''.join(map(lambda v: v if v else '',
                        map(m.group, ('text', 'content'))))

cleaned = re.sub(pattern, replace, my_str)
print('cleaned: {!r}'.format(cleaned))

Output:

cleaned: "Here's some sample stuff in the middle of some other text."

However from the documentation it sounds like I should be able to get the same results by just passing it a replacement string with references to the named groups in it. However my attempt to do that didn't work because sometimes a group is unmatched and the value returned for it is None (rather than an empty string '').

cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
print('cleaned: {!r}'.format(cleaned))

Output:

Traceback (most recent call last):
  File "test_resub.py", line 21, in <module>
    cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
  File "C:\Python\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Python\lib\re.py", line 278, in filter
    return sre_parse.expand_template(template, match)
  File "C:\Python\lib\sre_parse.py", line 802, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

What am I doing wrong or not understanding?

like image 747
martineau Avatar asked Dec 23 '14 21:12

martineau


2 Answers

def repl(matchobj):
    if matchobj.group(3):
        return matchobj.group(1)+matchobj.group(3)
    else:
        return matchobj.group(1)

my_str = "Here's some <first>sample stuff</first> in the " \
        "<second>middle</second> of some other text."

pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'
print re.sub(pattern, repl, my_str)

You can use the call function of re.sub.

Edit: cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str) this will not work as when the last bit of string matches i.e of some other text. there is \g<text> defined but no \g<content> as there is not content.But you still ask re.sub to do it.So it generates the error.If you use the string "Here's some <first>sample stuff</first> in the <second>middle</second>" then your print re.sub(pattern,r"\g<text>\g<content>", my_str) will work as \g<content> is defined all the time here.

like image 64
vks Avatar answered Sep 29 '22 13:09

vks


If I understand correctly, you want to remove everything between < > inclusive:

>>> import re

>>> my_str = "Here's some <first>sample stuff</first> in the <second>middle</second> of some other text."

>>> print re.sub(r'<.*?>', '', my_str)

Here's some sample stuff in the middle of some other text.

To somewhat explain what's going on here... the r'<.*?>':

< finds the first <

. then accept any character

* accept any character any number of times

? limit the result to the shortest possible, without this, it would go until the last > instead of the first available one

> find the closing point >

Then, replace everything between those two points with nothing.

like image 31
MrAlexBailey Avatar answered Sep 29 '22 13:09

MrAlexBailey