Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml - ignore <br> tag in html

I wrote a tiny html-parser in Python using lxml. It's very useful, but I have a problem.

I have the following code:

tags = doc.xpath('//table//tr/td[@align="right"]/b')
for tag in tags:
    print(x.text.strip())

It works fine. But if there is a <br> tag inside a <b> element, like this:

<b> first-half <br>
    second-half </b>

this code will only print first-half into the <b> tag.

How can I get all of text in <b> even if there is a <br> tag?

Thanks.

like image 260
shau-kote Avatar asked Feb 28 '13 21:02

shau-kote


Video Answer


1 Answers

Use text_content() to extract all of the non-markup text within a tag. Replace x.text with x.text_content().

like image 145
Anorov Avatar answered Sep 21 '22 23:09

Anorov