Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

matching multiple line in python regular expression

Tags:

python

I want to extract the data between <tr> tags from an html page. I used the following code.But i didn't get any result. The html between the <tr> tags is in multiple lines

category =re.findall('<tr>(.*?)</tr>',data);

Please suggest a fix for this problem.

like image 989
Sreejith Sasidharan Avatar asked Feb 04 '10 12:02

Sreejith Sasidharan


People also ask

How do you match multiple lines in Python?

MULTILINE search modifier forces the ^ symbol to match at the beginning of each line of text (and not just the first), and the $ symbol to match at the end of each line of text (and not just the last one).

What is multiline in regex?

Multiline option, or the m inline option, enables the regular expression engine to handle an input string that consists of multiple lines. It changes the interpretation of the ^ and $ language elements so that they match the beginning and end of a line, instead of the beginning and end of the input string.

Which flag will search over multiple lines in Python?

Practical Data Science using PythonDOTALL flag tells python to make the '. ' special character match all characters, including newline characters. This is a paragraph. It has multiple lines.


1 Answers

just to clear up the issue. Despite all those links to re.M it wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S, if you wouldn't try to parse html, of course:

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]
like image 158
SilentGhost Avatar answered Sep 22 '22 01:09

SilentGhost