I want to replace the text of matched re patterns in a string and can do this using re.sub()
. If I pass it a function as the repl
argument in the call it works as desired, as illustrated below:
from __future__ import print_function
import re
pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'
my_str = "Here's some <first>sample stuff</first> in the " \
"<second>middle</second> of some other text."
def replace(m):
return ''.join(map(lambda v: v if v else '',
map(m.group, ('text', 'content'))))
cleaned = re.sub(pattern, replace, my_str)
print('cleaned: {!r}'.format(cleaned))
Output:
cleaned: "Here's some sample stuff in the middle of some other text."
However from the documentation it sounds like I should be able to get the same results by just passing it a replacement string with references to the named groups in it. However my attempt to do that didn't work because sometimes a group is unmatched and the value returned for it is None
(rather than an empty string ''
).
cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
print('cleaned: {!r}'.format(cleaned))
Output:
Traceback (most recent call last):
File "test_resub.py", line 21, in <module>
cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
File "C:\Python\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "C:\Python\lib\re.py", line 278, in filter
return sre_parse.expand_template(template, match)
File "C:\Python\lib\sre_parse.py", line 802, in expand_template
raise error, "unmatched group"
sre_constants.error: unmatched group
What am I doing wrong or not understanding?
def repl(matchobj):
if matchobj.group(3):
return matchobj.group(1)+matchobj.group(3)
else:
return matchobj.group(1)
my_str = "Here's some <first>sample stuff</first> in the " \
"<second>middle</second> of some other text."
pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'
print re.sub(pattern, repl, my_str)
You can use the call function of re.sub
.
Edit:
cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
this will not work as when the last bit of string matches i.e of some other text.
there is \g<text>
defined but no \g<content>
as there is not content.But you still ask re.sub
to do it.So it generates the error.If you use the string "Here's some <first>sample stuff</first> in the <second>middle</second>"
then your print re.sub(pattern,r"\g<text>\g<content>", my_str)
will work as \g<content>
is defined all the time here.
If I understand correctly, you want to remove everything between < >
inclusive:
>>> import re
>>> my_str = "Here's some <first>sample stuff</first> in the <second>middle</second> of some other text."
>>> print re.sub(r'<.*?>', '', my_str)
Here's some sample stuff in the middle of some other text.
To somewhat explain what's going on here... the r'<.*?>'
:
<
finds the first <
.
then accept any character
*
accept any character any number of times
?
limit the result to the shortest possible, without this, it would go until the last >
instead of the first available one
>
find the closing point >
Then, replace everything between those two points with nothing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With