I am trying to query with XPath an html document parsed with lxml. The document is a straight html-only download of the page about Plastic in Wikipedia. Then I parse it with lxml disabling entity substitution to avoid an error with '®'
from lxml import etree
root = etree.parse("plastic.html",etree.XMLParser(resolve_entities=False))
Then, I retrieve the namespace url
htmltag = root.iter().next()
nsurl = htmltag.nsmap.values()[0]
Now, I would like to use xpath queries on either 'root' or 'htmltag', but I am unable to do so. I have tried different ways, but the following seems to me the most correct form, which yields errors anyway.
root.xpath('//ns:body',namespace={'ns',nsurl})
And this is what I get
XPathResultError: Unknown return type: dict
I am running the commands in an IPython console, but I don't think that might be the problem. What am I doing wrong?
Check lxml Version Python To check which version of lxml is installed, use pip show lxml or pip3 show lxml in your CMD/Powershell (Windows), or terminal (macOS/Linux/Ubuntu) to obtain the output major.
lxml is not written in plain Python, because it interfaces with two C libraries: libxml2 and libxslt. Accessing them at the C-level is required for performance reasons.
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.
This is a simple miss spell. You should use namespaces
instead of namespace
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With