xpath
inside
empty

Question

I started to work with xpath in python3 and are facing this behaviour. It seems very wrong to me. Why does it match span-text, but not p-text inside h3?

>>> from lxml import etree

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]

>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']

Thanks a lot!

har07 · Accepted Answer

Your first XPath correctly returned no result because <h3> in the corresponding tree didn't contain any text node. You can use tostring() method to see the actual content of the tree :

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'

The parser probably did this -turned h3 into empty element- because it considers paragraph inside a heading tag not valid (while span inside heading is valid) : Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?

To keep p elements inside h3 you can try using different parser i.e using BeautifulSoup's parser :

>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'

xpath <p> inside <h3> empty

Tags:

python

python-3.x

xpath

lxml

Florian

1 Answers

har07

Recent Activity

Donate For Us