I have an xml doc that I am trying to parse using Etree.lxml
<Envelope xmlns="http://www.example.com/zzz/yyy"> <Header> <Version>1</Version> </Header> <Body> some stuff <Body> <Envelope>
My code is:
path = "path to xml file" from lxml import etree as ET parser = ET.XMLParser(ns_clean=True) dom = ET.parse(path, parser) dom.getroot()
When I try to get dom.getroot() I get:
<Element {http://www.example.com/zzz/yyy}Envelope at 28adacac>
However I only want:
<Element Envelope at 28adacac>
When i do
dom.getroot().find("Body")
I get nothing returned. However, when I
dom.getroot().find("{http://www.example.com/zzz/yyy}Body")
I get a result.
I thought passing ns_clean=True to the parser would prevent this.
Any ideas?
lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
The lxml. objectify, element trees provide an API that models the behaviour of normal Python object trees as closely as possible.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
import io import lxml.etree as ET content='''\ <Envelope xmlns="http://www.example.com/zzz/yyy"> <Header> <Version>1</Version> </Header> <Body> some stuff </Body> </Envelope> ''' dom = ET.parse(io.BytesIO(content))
You can find namespace-aware nodes using the xpath
method:
body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'}) print(body) # [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>]
If you really want to remove namespaces, you could use an XSL transformation:
# http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no"/> <xsl:template match="/|comment()|processing-instruction()"> <xsl:copy> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="*"> <xsl:element name="{local-name()}"> <xsl:apply-templates select="@*|node()"/> </xsl:element> </xsl:template> <xsl:template match="@*"> <xsl:attribute name="{local-name()}"> <xsl:value-of select="."/> </xsl:attribute> </xsl:template> </xsl:stylesheet> ''' xslt_doc=ET.parse(io.BytesIO(xslt)) transform=ET.XSLT(xslt_doc) dom=transform(dom)
Here we see the namespace has been removed:
print(ET.tostring(dom)) # <Envelope> # <Header> # <Version>1</Version> # </Header> # <Body> # some stuff # </Body> # </Envelope>
So you can now find the Body node this way:
print(dom.find("Body")) # <Element Body at 8506cd4>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With