Python's built-in xml.etree
package supports parsing XML files with namespaces, but namespace prefixes get expanded to the full URI enclosed in brackets. So in the example file in the official documentation:
<actors xmlns:fictional="http://characters.example.com"
xmlns="http://people.example.com">
<actor>
<name>John Cleese</name>
<fictional:character>Lancelot</fictional:character>
<fictional:character>Archie Leach</fictional:character>
</actor>
...
The actor
tag gets expanded to {http://people.example.com}actor
and fictional:character
to {http://characters.example.com}character
.
I can see how this makes everything very explicit and reduces ambiguity (the file could have the same namespace with a different prefix, etc.) but it is very cumbersome to work with. The Element.find()
method and others allow passing a dict
mapping prefixes to namespace URIs so I can still do element.find('fictional:character', nsmap)
but to my knowledge there is nothing similar for tag attributes. This leads to annoying stuff like element.attrib['{{{}}}attrname'.format(nsmap['prefix'])]
.
The popular lxml
package provides the same API with a few extensions, one of which is an nsmap
property on the elements that they inherit from the document. However none of the methods seem to actually make use of it, so I still have to do element.find('fictional:character', element.nsmap)
which is just unnecessarily repetitive to type out every time. It also still doesn't work with attributes.
Luckily lxml
supports subclassing BaseElement
, so I just made one with a p
(for prefix) property that has the same API but automatically uses namespace prefixes using the element's nsmap
(Edit: likely best to assign a custom nsmap
defined in code). So I just do element.p.find('fictional:character')
or element.p.attrib['prefix:attrname']
, which much less repetitive and I think way more readable.
I just feel like I'm really missing something though, it really feels like this should really already be a feature of lxml
if not the builtin etree
package. Am I somehow doing this wrong?
Do you need to pass it as a parameter into each function call? An option would be to set the prefixes to be used at the XML document in a property.
That's fine until you pass the XML document into a 3rd party function. That function wants to use prefixes as well, so it sets the property to something else, because it does not know what you set it to.
As soon as you get the XML document back, it was modified, so your prefixes don't work any more.
All in all: no, it's not safe and therefore it's good as it is.
This design does not only exist in Python, it also exists in .NET. The SelectNodes()
[MSDN] can be used if you don't need prefixes. But as soon as there's a prefix present, it'll throw an exception. Therefore, you have to use the overloaded SelectNodes()
[MSDN] which uses an XmlNamespaceManager as a parameter.
I suggest to learn XPath (lxml specific link), where you can use prefixes. Since this may be version specific, let me say I ran this code with Python 2.7 x64 and lxml 3.6.0 (I'm not too familiar with Python, so this may not be the cleanest code, but it serves well as a demonstration):
from lxml import etree as ET
from pprint import pprint
data = """<?xml version="1.0"?>
<d:data xmlns:d="dns">
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor d:name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
</d:data>"""
root = ET.fromstring(data)
my_namespaces = {'x':'dns'}
xp=root.xpath("/x:data/country/neighbor/@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("//@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("/x:data/country/neighbor/@name", namespaces=my_namespaces)
pprint(xp)
The output is
C:\Python27x64\python.exe E:/xpath.py
['Austria']
['Austria']
['Switzerland', 'Malaysia']
Process finished with exit code 0
Note how well XPath solved the mapping from x
prefix in the namespace table to the d
prefix in the XML document.
This eliminates the really awful to read element.attrib['{{{}}}attrname'.format(nsmap['prefix'])]
.
To select an element, write /element
, optionally use a prefix.
xp=root.xpath("/x:data", namespaces=my_namespaces)
To select an attribute, write /@attribute
, optionally use a prefix.
#See example above
To navigate down, concatenate several elements. Use //
if you don't know items in between. To move up, use /..
. Attributes must be last if not followed by /..
.
xp=root.xpath("/x:data/country/neighbor/@x:name/..", namespaces=my_namespaces)
To use a condition, write it in square brackets. /element[@attribute]
means: select all elements that have this attribute. /element[@attribute='value']
means: select all elements that have this attribute and the attribute has a specific value. /element[./subelement]
means: select all elements that have a subelement with a specific name. Optionally use prefixes anywhere.
xp=root.xpath("/x:data/country[./neighbor[@name='Switzerland']]/@name", namespaces=my_namespaces)
There's much more to discover, like text()
, various ways of sibling selection and even functions.
The original question title which was
Why does working with XML namespaces seem so difficult in Python?
For some users, they just don't understand the concept. If the user understands the concept, maybe the developer didn't. And perhaps it was just one option out of many and the decision was to go that direction. The only person who could give an answer on the "why" part in such a case would be the developer himself.
If you need to avoid repeating nsmap parameters using ElementTree in Python, consider transforming your XML with XSLT to remove namespaces and return local element names. And Python's lxml can run XSLT 1.0 scripts.
As information, XSLT is a special-purpose declarative language (same family as XPath but interacts with whole documents) used specifically to transform XML sources. In fact, XSLT scripts are well-formed XML documents! And removing namespaces is an often used task for end user needs.
Consider the following with XML and XSLT embedded as strings (but each can be parsed from file). Once transformed, you can run .findall()
, iter()
, and .xpath()
on the transformed new tree object without need of defining namespace prefixes:
Script
import lxml.etree as ET
# LOAD XML AND XSL STRINGS
xmlStr = '''
<actors xmlns:fictional="http://characters.example.com"
xmlns="http://people.example.com">
<actor>
<name>John Cleese</name>
<fictional:character>Lancelot</fictional:character>
<fictional:character>Archie Leach</fictional:character>
</actor>
</actors>
'''
dom = ET.fromstring(xmlStr)
xslStr = '''
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:element name="{local-name()}">
<xsl:apply-templates select="@*|node()"/>
</xsl:element>
</xsl:template>
<xsl:template match="text()">
<xsl:copy/>
</xsl:template>
</xsl:transform>
'''
xslt = ET.fromstring(xslStr)
# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT AND PARSE
print(str(newdom))
for i in newdom.findall('//character'):
print(i.text)
for i in newdom.iter('character'):
print(i.text)
for i in newdom.xpath('//character'):
print(i.text)
Output
<?xml version="1.0"?>
<actors>
<actor>
<name>John Cleese</name>
<character>Lancelot</character>
<character>Archie Leach</character>
</actor>
</actors>
Lancelot
Archie Leach
Lancelot
Archie Leach
Lancelot
Archie Leach
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With