Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml etree xmlparser remove unwanted namespace

I have an xml doc that I am trying to parse using Etree.lxml

<Envelope xmlns="http://www.example.com/zzz/yyy">   <Header>     <Version>1</Version>   </Header>   <Body>     some stuff   <Body> <Envelope> 

My code is:

path = "path to xml file" from lxml import etree as ET parser = ET.XMLParser(ns_clean=True) dom = ET.parse(path, parser) dom.getroot() 

When I try to get dom.getroot() I get:

<Element {http://www.example.com/zzz/yyy}Envelope at 28adacac> 

However I only want:

<Element Envelope at 28adacac> 

When i do

dom.getroot().find("Body") 

I get nothing returned. However, when I

dom.getroot().find("{http://www.example.com/zzz/yyy}Body")  

I get a result.

I thought passing ns_clean=True to the parser would prevent this.

Any ideas?

like image 429
Mark Avatar asked Nov 23 '10 10:11

Mark


People also ask

What is lxml Etree?

lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

What is lxml objectify?

The lxml. objectify, element trees provide an API that models the behaviour of normal Python object trees as closely as possible.

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


1 Answers

import io import lxml.etree as ET  content='''\ <Envelope xmlns="http://www.example.com/zzz/yyy">   <Header>     <Version>1</Version>   </Header>   <Body>     some stuff   </Body> </Envelope> '''     dom = ET.parse(io.BytesIO(content)) 

You can find namespace-aware nodes using the xpath method:

body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'}) print(body) # [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>] 

If you really want to remove namespaces, you could use an XSL transformation:

# http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no"/>  <xsl:template match="/|comment()|processing-instruction()">     <xsl:copy>       <xsl:apply-templates/>     </xsl:copy> </xsl:template>  <xsl:template match="*">     <xsl:element name="{local-name()}">       <xsl:apply-templates select="@*|node()"/>     </xsl:element> </xsl:template>  <xsl:template match="@*">     <xsl:attribute name="{local-name()}">       <xsl:value-of select="."/>     </xsl:attribute> </xsl:template> </xsl:stylesheet> '''  xslt_doc=ET.parse(io.BytesIO(xslt)) transform=ET.XSLT(xslt_doc) dom=transform(dom) 

Here we see the namespace has been removed:

print(ET.tostring(dom)) # <Envelope> #   <Header> #     <Version>1</Version> #   </Header> #   <Body> #     some stuff #   </Body> # </Envelope> 

So you can now find the Body node this way:

print(dom.find("Body")) # <Element Body at 8506cd4> 
like image 157
unutbu Avatar answered Sep 22 '22 12:09

unutbu