I started to work with xpath in python3 and are facing this behaviour. It seems very wrong to me. Why does it match span-text, but not p-text inside h3?
>>> from lxml import etree
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]
>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']
Thanks a lot!
Your first XPath correctly returned no result because <h3> in the corresponding tree didn't contain any text node. You can use tostring() method to see the actual content of the tree :
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'
The parser probably did this -turned h3 into empty element- because it considers paragraph inside a heading tag not valid (while span inside heading is valid) : Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?
To keep p elements inside h3 you can try using different parser i.e using BeautifulSoup's parser :
>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With