Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx Get string between two strings that has line breaks

I have the following test (formatted just like below):

<td scope="row" align="left">
      My Class: TEST DATA<br>
      Test Section: <br>
      MY SECTION<br>
      MY SECTION 2<br>
    </td>

I'm attempting to get the text between "Test Section: and the after the MY SECTION

I've tried several attempts with different RegEx patterns and I'm not getting anywhere.

If I do:

(?<=Test)(.*?)(?=<br)

Then I get the correct response of:

' Section: '

But, if I do

(?<=Test)(.*?)(?=</td>)

I get no results. The results should be "MY SECTIon
MY SECTION 2
"

I've tried using RegEx Multiline as well with no results.

Any help would be appreciated.

If it matters I'm coding in Python 2.7.

If something is not clear, or you need more info, please let me know.

like image 873
CodeLikeBeaker Avatar asked Jul 21 '14 14:07

CodeLikeBeaker


2 Answers

Use re.S or re.DOTALL flags. Or prepend the regular expression with (?s) to make . matches all character (including newline).

Without the flags, . does not match newline.

(?s)(?<=Test)(.*?)(?=</td>)

Example:

>>> s = '''<td scope="row" align="left">
...       My Class: TEST DATA<br>
...       Test Section: <br>
...       MY SECTION<br>
...       MY SECTION 2<br>
...     </td>'''
>>>
>>> import re
>>> re.findall('(?<=Test)(.*?)(?=</td>)', s)  # without flags
[]
>>> re.findall('(?<=Test)(.*?)(?=</td>)', s, flags=re.S)
[' Section: <br>\n      MY SECTION<br>\n      MY SECTION 2<br>\n    ']
>>> re.findall('(?s)(?<=Test)(.*?)(?=</td>)', s)
[' Section: <br>\n      MY SECTION<br>\n      MY SECTION 2<br>\n    ']
like image 181
falsetru Avatar answered Sep 27 '22 21:09

falsetru


Get the matched group from index 1

Test Section:([\S\s]*)</td>

Live demo

Note: change the last part as per your need.

sample code:

import re
p = re.compile(ur'Test Section:([\S\s]*)</td>', re.MULTILINE)
test_str = u"..."

re.findall(p, test_str)

Pattern Explanation:

  Test Section:            'Test Section:'
  (                        group and capture to \1:
    [\S\s]*                  any character of: non-whitespace (all
                             but \n, \r, \t, \f, and " "), whitespace
                             (\n, \r, \t, \f, and " ") (0 or more
                             times (matching the most amount
                             possible))
  )                        end of \1
  </td>                    '</td>'
like image 45
Braj Avatar answered Sep 27 '22 21:09

Braj