I have the following test (formatted just like below):
<td scope="row" align="left">
My Class: TEST DATA<br>
Test Section: <br>
MY SECTION<br>
MY SECTION 2<br>
</td>
I'm attempting to get the text between "Test Section: and the after the MY SECTION
I've tried several attempts with different RegEx patterns and I'm not getting anywhere.
If I do:
(?<=Test)(.*?)(?=<br)
Then I get the correct response of:
' Section: '
But, if I do
(?<=Test)(.*?)(?=</td>)
I get no results. The results should be "MY SECTIon
MY SECTION 2
"
I've tried using RegEx Multiline as well with no results.
Any help would be appreciated.
If it matters I'm coding in Python 2.7.
If something is not clear, or you need more info, please let me know.
Use re.S
or re.DOTALL
flags. Or prepend the regular expression with (?s)
to make .
matches all character (including newline).
Without the flags, .
does not match newline.
(?s)(?<=Test)(.*?)(?=</td>)
Example:
>>> s = '''<td scope="row" align="left">
... My Class: TEST DATA<br>
... Test Section: <br>
... MY SECTION<br>
... MY SECTION 2<br>
... </td>'''
>>>
>>> import re
>>> re.findall('(?<=Test)(.*?)(?=</td>)', s) # without flags
[]
>>> re.findall('(?<=Test)(.*?)(?=</td>)', s, flags=re.S)
[' Section: <br>\n MY SECTION<br>\n MY SECTION 2<br>\n ']
>>> re.findall('(?s)(?<=Test)(.*?)(?=</td>)', s)
[' Section: <br>\n MY SECTION<br>\n MY SECTION 2<br>\n ']
Get the matched group from index 1
Test Section:([\S\s]*)</td>
Live demo
Note: change the last part as per your need.
sample code:
import re
p = re.compile(ur'Test Section:([\S\s]*)</td>', re.MULTILINE)
test_str = u"..."
re.findall(p, test_str)
Pattern Explanation:
Test Section: 'Test Section:'
( group and capture to \1:
[\S\s]* any character of: non-whitespace (all
but \n, \r, \t, \f, and " "), whitespace
(\n, \r, \t, \f, and " ") (0 or more
times (matching the most amount
possible))
) end of \1
</td> '</td>'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With