Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can ElementTree be told to preserve the order of attributes?

I've written a fairly simple filter in python using ElementTree to munge the contexts of some xml files. And it works, more or less.

But it reorders the attributes of various tags, and I'd like it to not do that.

Does anyone know a switch I can throw to make it keep them in specified order?

Context for this

I'm working with and on a particle physics tool that has a complex, but oddly limited configuration system based on xml files. Among the many things setup that way are the paths to various static data files. These paths are hardcoded into the existing xml and there are no facilities for setting or varying them based on environment variables, and in our local installation they are necessarily in a different place.

This isn't a disaster because the combined source- and build-control tool we're using allows us to shadow certain files with local copies. But even thought the data fields are static the xml isn't, so I've written a script for fixing the paths, but with the attribute rearrangement diffs between the local and master versions are harder to read than necessary.


This is my first time taking ElementTree for a spin (and only my fifth or sixth python project) so maybe I'm just doing it wrong.

Abstracted for simplicity the code looks like this:

tree = elementtree.ElementTree.parse(inputfile) i = tree.getiterator() for e in i:     e.text = filter(e.text) tree.write(outputfile) 

Reasonable or dumb?


Related links:

  • How can I get the order of an element attribute list using Python xml.sax?
  • Preserve order of attributes when modifying with minidom
like image 309
dmckee --- ex-moderator kitten Avatar asked Apr 29 '10 23:04

dmckee --- ex-moderator kitten


People also ask

Does the order of attributes in XML matter?

According to the XML specification, the order of attribute specifications in a start-tag or empty-element tag is not significant.

What is ElementTree?

The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.

What does Etree parse do?

Parsing from strings and files. lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.


2 Answers

With help from @bobince's answer and these two (setting attribute order, overriding module methods)

I managed to get this monkey patched it's dirty and I'd suggest using another module that better handles this scenario but when that isn't a possibility:

# ======================================================================= # Monkey patch ElementTree import xml.etree.ElementTree as ET  def _serialize_xml(write, elem, encoding, qnames, namespaces):     tag = elem.tag     text = elem.text     if tag is ET.Comment:         write("<!--%s-->" % ET._encode(text, encoding))     elif tag is ET.ProcessingInstruction:         write("<?%s?>" % ET._encode(text, encoding))     else:         tag = qnames[tag]         if tag is None:             if text:                 write(ET._escape_cdata(text, encoding))             for e in elem:                 _serialize_xml(write, e, encoding, qnames, None)         else:             write("<" + tag)             items = elem.items()             if items or namespaces:                 if namespaces:                     for v, k in sorted(namespaces.items(),                                        key=lambda x: x[1]):  # sort on prefix                         if k:                             k = ":" + k                         write(" xmlns%s=\"%s\"" % (                             k.encode(encoding),                             ET._escape_attrib(v, encoding)                             ))                 #for k, v in sorted(items):  # lexical order                 for k, v in items: # Monkey patch                     if isinstance(k, ET.QName):                         k = k.text                     if isinstance(v, ET.QName):                         v = qnames[v.text]                     else:                         v = ET._escape_attrib(v, encoding)                     write(" %s=\"%s\"" % (qnames[k], v))             if text or len(elem):                 write(">")                 if text:                     write(ET._escape_cdata(text, encoding))                 for e in elem:                     _serialize_xml(write, e, encoding, qnames, None)                 write("</" + tag + ">")             else:                 write(" />")     if elem.tail:         write(ET._escape_cdata(elem.tail, encoding))  ET._serialize_xml = _serialize_xml  from collections import OrderedDict  class OrderedXMLTreeBuilder(ET.XMLTreeBuilder):     def _start_list(self, tag, attrib_in):         fixname = self._fixname         tag = fixname(tag)         attrib = OrderedDict()         if attrib_in:             for i in range(0, len(attrib_in), 2):                 attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])         return self._target.start(tag, attrib)  # ======================================================================= 

Then in your code:

tree = ET.parse(pathToFile, OrderedXMLTreeBuilder()) 
like image 102
SnellyBigoda Avatar answered Sep 16 '22 16:09

SnellyBigoda


Nope. ElementTree uses a dictionary to store attribute values, so it's inherently unordered.

Even DOM doesn't guarantee you attribute ordering, and DOM exposes a lot more detail of the XML infoset than ElementTree does. (There are some DOMs that do offer it as a feature, but it's not standard.)

Can it be fixed? Maybe. Here's a stab at it that replaces the dictionary when parsing with an ordered one (collections.OrderedDict()).

from xml.etree import ElementTree from collections import OrderedDict import StringIO  class OrderedXMLTreeBuilder(ElementTree.XMLTreeBuilder):     def _start_list(self, tag, attrib_in):         fixname = self._fixname         tag = fixname(tag)         attrib = OrderedDict()         if attrib_in:             for i in range(0, len(attrib_in), 2):                 attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])         return self._target.start(tag, attrib)  >>> xmlf = StringIO.StringIO('<a b="c" d="e" f="g" j="k" h="i"/>')  >>> tree = ElementTree.ElementTree() >>> root = tree.parse(xmlf, OrderedXMLTreeBuilder()) >>> root.attrib OrderedDict([('b', 'c'), ('d', 'e'), ('f', 'g'), ('j', 'k'), ('h', 'i')]) 

Looks potentially promising.

>>> s = StringIO.StringIO() >>> tree.write(s) >>> s.getvalue() '<a b="c" d="e" f="g" h="i" j="k" />' 

Bah, the serialiser outputs them in canonical order.

This looks like the line to blame, in ElementTree._write:

            items.sort() # lexical order 

Subclassing or monkey-patching that is going to be annoying as it's right in the middle of a big method.

Unless you did something nasty like subclass OrderedDict and hack items to return a special subclass of list that ignores calls to sort(). Nah, probably that's even worse and I should go to bed before I come up with anything more horrible than that.

like image 33
bobince Avatar answered Sep 16 '22 16:09

bobince