I want to retrieve a legacy xml file, manipulate and save it.
Here is my code:
from xml.etree import cElementTree as ET
NS = "{http://www.somedomain.com/XI/Traffic/10}"
def fix_xml(filename):
f = ET.parse(filename)
root = f.getroot()
eventlist = root.findall("%(ns)Event" % {'ns':NS })
xpath = "%(ns)sEventDetail/%(ns)sEventDescription" % {'ns':NS }
for event in eventlist:
desc = event.find(xpath)
desc.text = desc.text.upper() # do some editting to the text.
ET.ElementTree(root, nsmap=NS).write("out.xml", encoding="utf-8")
shorten_xml("test.xml")
The file I load contains:
xmlns="http://www.somedomain.com/XI/Traffic/10"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.somedomain.com/XI/Traffic/10 10.xds"
at the root tag.
I have the following problems, related to namespace:
<?xml version="1.0" encoding="utf-8"?>
at the begining.<ns0:eventDescription>
while I need output as the original <eventDescription>
, without namespace at the begining.How can these be solved?
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.
lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.
Is lxml safe to use? The python package lxml was scanned for known vulnerabilities and missing license, and no issues were found. Thus the package was deemed as safe to use.
Have a look at the lxml tutorial section on namespaces. Also this article about namespaces in ElementTree.
Problem 1: Put up with it, like everybody else does. Instead of "%(ns)Event" % {'ns':NS }
try NS+"Event"
.
Problem 2: By default, the XML declaration is written only if it is required. You can force it (lxml only) by using xml_declaration=True
in your write()
call.
Problem 3: The nsmap
arg appears to be lxml-only. AFAICT it needs a MAPping, not a string. Try nsmap={None: NS}
. The effbot article has a section describing a workaround for this.
To answer your questions in order:
you can't just ignore the namespace, not in the path syntax that .findall()
uses , but not in "real" xpath (supported by lxml) either: there you'd still be forced to use a prefix, and still need to provide some prefix-to-uri mapping.
use xml_declaration=True
as well as encoding='utf-8'
with the .write()
call (available in lxml, but in stdlib xml.etree only since python 2.7 I believe)
I believe lxml will do behave like you want
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With