Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: namespaces in xml ElementTree (or lxml)

I want to retrieve a legacy xml file, manipulate and save it.

Here is my code:

from xml.etree import cElementTree as ET
NS = "{http://www.somedomain.com/XI/Traffic/10}"

def fix_xml(filename):
    f = ET.parse(filename)
    root = f.getroot()
    eventlist = root.findall("%(ns)Event" % {'ns':NS })
    xpath = "%(ns)sEventDetail/%(ns)sEventDescription" % {'ns':NS }
    for event in eventlist:
        desc = event.find(xpath)
        desc.text = desc.text.upper() # do some editting to the text.

    ET.ElementTree(root, nsmap=NS).write("out.xml", encoding="utf-8")


shorten_xml("test.xml")

The file I load contains:

xmlns="http://www.somedomain.com/XI/Traffic/10"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.somedomain.com/XI/Traffic/10 10.xds"

at the root tag.

I have the following problems, related to namespace:

  • As you see, for each tag call, I have give the namespace at the begining to retreive a child.
  • Generated xml file doesn't have <?xml version="1.0" encoding="utf-8"?> at the begining.
  • The tags at the output contains such <ns0:eventDescription> while I need output as the original <eventDescription>, without namespace at the begining.

How can these be solved?

like image 974
Hellnar Avatar asked Feb 03 '11 12:02

Hellnar


People also ask

What is lxml used for in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

What is Python ElementTree?

ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.

Is lxml included in Python?

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.

Is lxml secure?

Is lxml safe to use? The python package lxml was scanned for known vulnerabilities and missing license, and no issues were found. Thus the package was deemed as safe to use.


2 Answers

Have a look at the lxml tutorial section on namespaces. Also this article about namespaces in ElementTree.

Problem 1: Put up with it, like everybody else does. Instead of "%(ns)Event" % {'ns':NS } try NS+"Event".

Problem 2: By default, the XML declaration is written only if it is required. You can force it (lxml only) by using xml_declaration=True in your write() call.

Problem 3: The nsmap arg appears to be lxml-only. AFAICT it needs a MAPping, not a string. Try nsmap={None: NS}. The effbot article has a section describing a workaround for this.

like image 64
John Machin Avatar answered Oct 16 '22 00:10

John Machin


To answer your questions in order:

  • you can't just ignore the namespace, not in the path syntax that .findall() uses , but not in "real" xpath (supported by lxml) either: there you'd still be forced to use a prefix, and still need to provide some prefix-to-uri mapping.

  • use xml_declaration=True as well as encoding='utf-8' with the .write() call (available in lxml, but in stdlib xml.etree only since python 2.7 I believe)

  • I believe lxml will do behave like you want

like image 36
Steven Avatar answered Oct 16 '22 02:10

Steven