Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xpath <p> inside <h3> empty

I started to work with xpath in python3 and are facing this behaviour. It seems very wrong to me. Why does it match span-text, but not p-text inside h3?

>>> from lxml import etree

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]

>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']

Thanks a lot!

like image 502
Florian Avatar asked Jun 17 '26 04:06

Florian


1 Answers

Your first XPath correctly returned no result because <h3> in the corresponding tree didn't contain any text node. You can use tostring() method to see the actual content of the tree :

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'

The parser probably did this -turned h3 into empty element- because it considers paragraph inside a heading tag not valid (while span inside heading is valid) : Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?

To keep p elements inside h3 you can try using different parser i.e using BeautifulSoup's parser :

>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'
like image 186
har07 Avatar answered Jun 18 '26 19:06

har07



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!