I have the following function which does a crude job of parsing an XML file into a dictionary.
Unfortunately, since Python dictionaries are not ordered, I am unable to cycle through the nodes as I would like.
How do I change this so it outputs an ordered dictionary which reflects the original order of the nodes when looped with for
.
def simplexml_load_file(file):
import collections
from lxml import etree
tree = etree.parse(file)
root = tree.getroot()
def xml_to_item(el):
item = None
if el.text:
item = el.text
child_dicts = collections.defaultdict(list)
for child in el.getchildren():
child_dicts[child.tag].append(xml_to_item(child))
return dict(child_dicts) or item
def xml_to_dict(el):
return {el.tag: xml_to_item(el)}
return xml_to_dict(root)
x = simplexml_load_file('routines/test.xml')
print x
for y in x['root']:
print y
Outputs:
{'root': {
'a': ['1'],
'aa': [{'b': [{'c': ['2']}, '2']}],
'aaaa': [{'bb': ['4']}],
'aaa': ['3'],
'aaaaa': ['5']
}}
a
aa
aaaa
aaa
aaaaa
How can I implement collections.OrderedDict
so that I can be sure of getting the correct order of the nodes?
XML file for reference:
<root>
<a>1</a>
<aa>
<b>
<c>2</c>
</b>
<b>2</b>
</aa>
<aaa>3</aaa>
<aaaa>
<bb>4</bb>
</aaaa>
<aaaaa>5</aaaaa>
</root>
This means, when iterating over an OrderedDict, the elements are returned in the same order in which the keys (of the key-value pair) were inserted into the data structure. Under the hood, OrderedDict is implemented with the help of the doubly-linked list data structure.
Python OrderedDict is a dict subclass that maintains the items insertion order. When we iterate over an OrderedDict, items are returned in the order they were inserted. A regular dictionary doesn't track the insertion order. So when iterating over it, items are returned in an arbitrary order.
The OrderedDict is a subclass of dict object in Python. The only difference between OrderedDict and dict is that, in OrderedDict, it maintains the orders of keys as inserted. In the dict, the ordering may or may not be happen. The OrderedDict is a standard library class, which is located in the collections module.
You could use the new OrderedDict
dict
subclass which was added to the standard library's collections
module in version 2.7✶. Actually what you need is an Ordered
+defaultdict
combination which doesn't exist — but it's possible to create one by subclassing OrderedDict
as illustrated below:
✶ If your version of Python doesn't have OrderedDict
, you should be able use Raymond Hettinger's Ordered Dictionary for Py2.4 ActiveState recipe as the base class instead.
import collections
class OrderedDefaultdict(collections.OrderedDict):
""" A defaultdict with OrderedDict as its base class. """
def __init__(self, default_factory=None, *args, **kwargs):
if not (default_factory is None or callable(default_factory)):
raise TypeError('first argument must be callable or None')
super(OrderedDefaultdict, self).__init__(*args, **kwargs)
self.default_factory = default_factory # called by __missing__()
def __missing__(self, key):
if self.default_factory is None:
raise KeyError(key,)
self[key] = value = self.default_factory()
return value
def __reduce__(self): # Optional, for pickle support.
args = (self.default_factory,) if self.default_factory else tuple()
return self.__class__, args, None, None, iter(self.items())
def __repr__(self): # Optional.
return '%s(%r, %r)' % (self.__class__.__name__, self.default_factory, self.items())
def simplexml_load_file(file):
from lxml import etree
tree = etree.parse(file)
root = tree.getroot()
def xml_to_item(el):
item = el.text or None
child_dicts = OrderedDefaultdict(list)
for child in el.getchildren():
child_dicts[child.tag].append(xml_to_item(child))
return collections.OrderedDict(child_dicts) or item
def xml_to_dict(el):
return {el.tag: xml_to_item(el)}
return xml_to_dict(root)
x = simplexml_load_file('routines/test.xml')
print(x)
for y in x['root']:
print(y)
The output produced from your test XML file looks like this:
{'root':
OrderedDict(
[('a', ['1']),
('aa', [OrderedDict([('b', [OrderedDict([('c', ['2'])]), '2'])])]),
('aaa', ['3']),
('aaaa', [OrderedDict([('bb', ['4'])])]),
('aaaaa', ['5'])
]
)
}
a
aa
aaa
aaaa
aaaaa
Which I think is close to what you want.
Minor update:
Added a __reduce__()
method which will allow the instances of the class to be pickled and unpickled properly. This wasn't necessary for this question, but came up in a similar one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With