I am attempting to use Lxml to parse the contents of a .docx document. I understand that lxml replaces namespace prefixes with the actual namespace, however this makes it a real pain to check what kind of element tag I am working with. I would like to be able to do something like
if (someElement.tag == "w:p"):
but since lxml insists on prepending te ful namespace I'd either have to do something like
if (someElemenet.tag == "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'):
or perform a lookup of the full namespace name from the element's nsmap attribute like this
targetTag = "{%s}p" % someElement.nsmap['w']
if (someElement.tag == targetTag):
If there were was an easier way to convince lxml to either
This would save a lot of keystrokes when writing this parser. Is this possible? Am I missing something in the documentation?
lxml. etree only returns real Elements, i.e. tree nodes that have a string tag name. Without a filter, both libraries iterate over all nodes. Note that currently only lxml. etree supports passing the Element factory function as filter to select only Elements.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
Perhaps use local-name():
import lxml.etree as ET
tree = ET.fromstring('<root xmlns:f="foo"><f:test/></root>')
elt=tree[0]
print(elt.xpath('local-name()'))
# test
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With