Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make my code more readable and DRYer when working with XML namespaces in Python?

Python's built-in xml.etree package supports parsing XML files with namespaces, but namespace prefixes get expanded to the full URI enclosed in brackets. So in the example file in the official documentation:

<actors xmlns:fictional="http://characters.example.com"
    xmlns="http://people.example.com">
    <actor>
        <name>John Cleese</name>
        <fictional:character>Lancelot</fictional:character>
        <fictional:character>Archie Leach</fictional:character>
    </actor>
    ...

The actor tag gets expanded to {http://people.example.com}actor and fictional:character to {http://characters.example.com}character.

I can see how this makes everything very explicit and reduces ambiguity (the file could have the same namespace with a different prefix, etc.) but it is very cumbersome to work with. The Element.find() method and others allow passing a dict mapping prefixes to namespace URIs so I can still do element.find('fictional:character', nsmap) but to my knowledge there is nothing similar for tag attributes. This leads to annoying stuff like element.attrib['{{{}}}attrname'.format(nsmap['prefix'])].

The popular lxml package provides the same API with a few extensions, one of which is an nsmap property on the elements that they inherit from the document. However none of the methods seem to actually make use of it, so I still have to do element.find('fictional:character', element.nsmap) which is just unnecessarily repetitive to type out every time. It also still doesn't work with attributes.

Luckily lxml supports subclassing BaseElement, so I just made one with a p (for prefix) property that has the same API but automatically uses namespace prefixes using the element's nsmap (Edit: likely best to assign a custom nsmap defined in code). So I just do element.p.find('fictional:character') or element.p.attrib['prefix:attrname'], which much less repetitive and I think way more readable.

I just feel like I'm really missing something though, it really feels like this should really already be a feature of lxml if not the builtin etree package. Am I somehow doing this wrong?

like image 206
JaredL Avatar asked May 16 '16 21:05

JaredL


2 Answers

Is it possible to get rid of the namespace mapping?

Do you need to pass it as a parameter into each function call? An option would be to set the prefixes to be used at the XML document in a property.

That's fine until you pass the XML document into a 3rd party function. That function wants to use prefixes as well, so it sets the property to something else, because it does not know what you set it to.

As soon as you get the XML document back, it was modified, so your prefixes don't work any more.

All in all: no, it's not safe and therefore it's good as it is.

This design does not only exist in Python, it also exists in .NET. The SelectNodes() [MSDN] can be used if you don't need prefixes. But as soon as there's a prefix present, it'll throw an exception. Therefore, you have to use the overloaded SelectNodes() [MSDN] which uses an XmlNamespaceManager as a parameter.

XPath as a solution

I suggest to learn XPath (lxml specific link), where you can use prefixes. Since this may be version specific, let me say I ran this code with Python 2.7 x64 and lxml 3.6.0 (I'm not too familiar with Python, so this may not be the cleanest code, but it serves well as a demonstration):

from lxml import etree as ET
from pprint import pprint
data = """<?xml version="1.0"?>
<d:data xmlns:d="dns">
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor d:name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
</d:data>"""
root = ET.fromstring(data)
my_namespaces = {'x':'dns'}
xp=root.xpath("/x:data/country/neighbor/@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("//@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("/x:data/country/neighbor/@name", namespaces=my_namespaces)
pprint(xp)

The output is

C:\Python27x64\python.exe E:/xpath.py
['Austria']
['Austria']
['Switzerland', 'Malaysia']

Process finished with exit code 0

Note how well XPath solved the mapping from x prefix in the namespace table to the d prefix in the XML document.

This eliminates the really awful to read element.attrib['{{{}}}attrname'.format(nsmap['prefix'])].

Short (and incomplete) XPath introduction

To select an element, write /element, optionally use a prefix.

xp=root.xpath("/x:data", namespaces=my_namespaces)

To select an attribute, write /@attribute, optionally use a prefix.

#See example above

To navigate down, concatenate several elements. Use // if you don't know items in between. To move up, use /... Attributes must be last if not followed by /...

xp=root.xpath("/x:data/country/neighbor/@x:name/..", namespaces=my_namespaces)

To use a condition, write it in square brackets. /element[@attribute] means: select all elements that have this attribute. /element[@attribute='value'] means: select all elements that have this attribute and the attribute has a specific value. /element[./subelement] means: select all elements that have a subelement with a specific name. Optionally use prefixes anywhere.

xp=root.xpath("/x:data/country[./neighbor[@name='Switzerland']]/@name", namespaces=my_namespaces)

There's much more to discover, like text(), various ways of sibling selection and even functions.

About the 'why'

The original question title which was

Why does working with XML namespaces seem so difficult in Python?

For some users, they just don't understand the concept. If the user understands the concept, maybe the developer didn't. And perhaps it was just one option out of many and the decision was to go that direction. The only person who could give an answer on the "why" part in such a case would be the developer himself.

References

  • XPath dictionary entry [Wikipedia]
  • XPath specification [W3C]
  • XPath tutorial [W3Schools]
like image 97
Thomas Weller Avatar answered Sep 21 '22 22:09

Thomas Weller


If you need to avoid repeating nsmap parameters using ElementTree in Python, consider transforming your XML with XSLT to remove namespaces and return local element names. And Python's lxml can run XSLT 1.0 scripts.

As information, XSLT is a special-purpose declarative language (same family as XPath but interacts with whole documents) used specifically to transform XML sources. In fact, XSLT scripts are well-formed XML documents! And removing namespaces is an often used task for end user needs.

Consider the following with XML and XSLT embedded as strings (but each can be parsed from file). Once transformed, you can run .findall(), iter(), and .xpath() on the transformed new tree object without need of defining namespace prefixes:

Script

import lxml.etree as ET

# LOAD XML AND XSL STRINGS
xmlStr = '''
         <actors xmlns:fictional="http://characters.example.com"
                 xmlns="http://people.example.com">
             <actor>
                 <name>John Cleese</name>
                 <fictional:character>Lancelot</fictional:character>
                 <fictional:character>Archie Leach</fictional:character>
             </actor>
         </actors>
         '''
dom = ET.fromstring(xmlStr)

xslStr = '''
        <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output version="1.0" encoding="UTF-8" indent="yes" />
        <xsl:strip-space elements="*"/>

          <xsl:template match="@*|node()">
            <xsl:element name="{local-name()}">
              <xsl:apply-templates select="@*|node()"/>
            </xsl:element>
          </xsl:template>

          <xsl:template match="text()">
            <xsl:copy/>
          </xsl:template>

        </xsl:transform>
        '''
xslt = ET.fromstring(xslStr)

# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)

# OUTPUT AND PARSE
print(str(newdom))

for i in newdom.findall('//character'):
    print(i.text)

for i in newdom.iter('character'):
    print(i.text)

for i in newdom.xpath('//character'):
    print(i.text)

Output

<?xml version="1.0"?>
<actors>
  <actor>
    <name>John Cleese</name>
    <character>Lancelot</character>
    <character>Archie Leach</character>
  </actor>
</actors>

Lancelot
Archie Leach
Lancelot
Archie Leach
Lancelot
Archie Leach
like image 32
Parfait Avatar answered Sep 20 '22 22:09

Parfait