Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I parse and write XML using Python's ElementTree without moving namespaces around?

Our project gets from upstream XML of this form:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <runtime>
    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-6.0.0.0" newVersion="7.0.0.0" />
      </dependentAssembly>
    </assemblyBinding>
  </runtime>
  <appSettings>
    <add key="foo" value="default">
    ...
  </appSettings>
</configuration>

It then reads/parses this XML using ElementTree, and then for every app setting matching a certain key ("foo"), it writes a new value that it knows about that the upstream process doesn't ( in this case key "foo" should have the value "bar").

The downstream process consuming the filtered XML is, aaahhhh... fragile. It expects to receive the XML in exactly the form above.

If I parse this XML without registering a namespace, then ElementTree mangles my tree like this on input:

<configuration xmlns:ns0="urn:schemas-microsoft-com:asm.v1">
  <runtime>
  <ns0:assemblyBinding>
    <ns0:dependentAssembly>
      <ns0:assemblyIdentity culture="neutral" name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" />
      <ns0:bindingRedirect newVersion="7.0.0.0" oldVersion="0.0.0.0-6.0.0.0" />
    </ns0:dependentAssembly>
  </ns0:assemblyBinding>
 </runtime>
 <appSettings>
    <add key="foo" value="default">
    ...
 </appSettings>
</configuration>

The downstream process can't handle this, because it's no clever enough to realize that, semantically, this is the same thing. So, I decide to register the namespace I know the upstream process will provide as a default namespace to avoid the prefixes showing up everywhere, and now I get this:

<configuration xmlns="urn:schemas-microsoft-com:asm.v1">
 <runtime>
  <assemblyBinding>
    <dependentAssembly>
      <assemblyIdentity culture="neutral" name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" />
      <bindingRedirect newVersion="7.0.0.0" oldVersion="0.0.0.0-6.0.0.0" />
    </dependentAssembly>
  </assemblyBinding>
 </runtime>
 <appSettings>
    <add key="foo" value="default">
    ...
 </appSettings>
</configuration>

I don't know much about XML, but this also the downstream component cries about, and it seems to me that doesn't now mean this default xmlns now apply to all included elements inside <configuration>, whereas before it only applied to the <assemblyBinding> element?

Is there anyway, using ElementTree, to handle this namespace so that I can take in the upstream's XML, set foo's value, and then pass that on downstream, without moving the namespace around, and leaving it exactly as I found it?

  • I could use an lxml-based solution, which seems to handle this, however, lxml has a dependency on C which the downstream component would really like not to have to support: a pure Python solution is preferable.

  • I could read the document as HTML which would ignore the namespace attribute, let me manipulate the value I want, and then pass on the document; however, I have yet to find a Python parser that doesn't downcase all the element names, and my downstream component requires the casing on all element names to be preserved.

  • I could resort to string parsing and regular expressions. I would rather not write my own parser.

The only advice I could find so far about namespace handling in ElementTree suggests the "register a default namespace to avoid prefixes" approach, which I assumed would be suitable, but ElementTree then insists on moving the xmlns declaration up to the root node upon dumping.

I could also be clever build up a string that dumps the tree out in stages and in exactly the right order to put the xmlns declaration back on the "right node", but that strikes me, also, as pretty darned fragile.

Has anyone managed to get past a problem like this?

like image 927
Viktor Haag Avatar asked Jul 18 '16 14:07

Viktor Haag


People also ask

Which module can you use to parse an XML file using Python?

Python allows parsing these XML documents using two modules namely, the xml. etree. ElementTree module and Minidom (Minimal DOM Implementation).

What is the name of the Python library to parse XML data?

ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.


1 Answers

As far as I know the solution that better suits your needs is to write a pure Python custom rendering using the features exposed by xml.etree.ElementTree. Here is one possible solution:

from xml.etree import ElementTree as ET
from re import findall, sub

def render(root, buffer='', namespaces=None, level=0, indent_size=2, encoding='utf-8'):
    buffer += f'<?xml version="1.0" encoding="{encoding}" ?>\n' if not level else ''
    root = root.getroot() if isinstance(root, ET.ElementTree) else root
    _, namespaces = ET._namespaces(root) if not level else (None, namespaces)
    for element in root.iter():
        indent = ' ' * indent_size * level
        tag = sub(r'({[^}]+}\s*)*', '', element.tag)
        buffer += f'{indent}<{tag}'
        for ns in findall(r'{[^}]+}', element.tag):
            ns_key = ns[1:-1]
            if ns_key not in namespaces: continue
            buffer += ' xmlns' + (f':{namespaces[ns_key]}' if namespaces[ns_key] != '' else '') + f'="{ns_key}"'
            del namespaces[ns_key]
        for k, v in element.attrib.items():
            buffer += f' {k}="{v}"'
        buffer += '>' + element.text.strip() if element.text else '>'
        children = list(element)
        for child in children:
            sep = '\n' if buffer[-1] != '\n' else ''
            buffer += sep + render(child, level=level+1, indent_size=indent_size, namespaces=namespaces)
        buffer += f'{indent}</{tag}>\n' if 0 != len(children) else f'</{tag}>\n'
    return buffer

By issuing theXML data you gave, to the above render function as show below:

data=\
'''<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <runtime>
    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" culture="neutral" />
        <bindingRedirect oldVersion="0.0.0.0-6.0.0.0" newVersion="7.0.0.0" />
      </dependentAssembly>
    </assemblyBinding>
  </runtime>
  <appSettings>
    <add key="foo" value="default" />
  </appSettings>
</configuration>'''

e = ET.fromstring(data)
ET.register_namespace('', "urn:schemas-microsoft-com:asm.v1")
r = ET.ElementTree(e)

You'll get the following resulting XML having the properties you stated you are looking for:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <runtime>
    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" culture="neutral"></assemblyIdentity>
        <bindingRedirect oldVersion="0.0.0.0-6.0.0.0" newVersion="7.0.0.0"></bindingRedirect>
      </dependentAssembly>
    </assemblyBinding>
  </runtime>
  <appSettings>
    <add key="foo" value="default"></add>
  </appSettings>
</configuration>

I know I came late to the party.. Anyway hoping this will help you and many other having the same issue, here it is a good solution. Happy coding!

like image 65
Giova Avatar answered Oct 16 '22 15:10

Giova