Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python re.sub non-greed substitute fails with a newline in the string [duplicate]

Tags:

python

regex

I've struck a problem with a regular expression in Python (2.7.9)

I'm trying to strip out HTML <span> tags using a regex like so:

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, re.S)

(the regex reads thusly: <span, anything that's not a >, then a >, then non-greedy-match anything, followed by a</span>, and use re.S (re.DOTALL) so . matches newline characters

This seems to work unless there is a newline in the text. It looks like re.S (DOTALL) doesn't apply within a non-greedy match.

Here's the test code; remove the newline from text1 and the re.sub works. Put it back in, and the re.sub fails. Put the newline char outside the <span> tag, and the re.sub works.

#!/usr/bin/env python
import re
text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
print repr(text1)
text2 = re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
print repr(text2)

For comparison, I wrote a Perl script to do the same thing; the regex works as I expect here.

#!/usr/bin/perl
$text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>';
print "$text1\n";
$text1 =~ s/<span[^>]*>(.*?)<\/span>/\1/s;
print "$text1\n";

Any ideas?

Tested in Python 2.6.6 and Python 2.7.9

like image 627
Andy Watkins Avatar asked Apr 18 '16 09:04

Andy Watkins


Video Answer


1 Answers

The 4th parameter of re.sub is a count, not a flags.

re.sub(pattern, repl, string, count=0, flags=0)¶

You need to use keyword argument to explicitly specify the flags:

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, flags=re.S)
                                                      ↑↑↑↑↑↑

Otherwise, re.S will be interpreted replacement count (maximum 16 times) instead of S (or DOTALL flags):

>>> import re
>>> re.S
16

>>> text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
'<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, flags=re.S)
'<body id="aa">this is a test\n with newline</body>'
like image 121
falsetru Avatar answered Oct 22 '22 22:10

falsetru