Suppose I have the following XML document:
<species>
Mammals: <dog/> <cat/>
Reptiles: <snake/> <turtle/>
Birds: <seagull/> <owl/>
</species>
Then I get the species
element like this:
import lxml.etree
doc = lxml.etree.fromstring(xml)
species = doc.xpath('/species')[0]
Now I would like to print a list of animals grouped by species. How could I do it using ElementTree API?
If you enumerate all of the nodes, you'll see a text node with the class followed by element nodes with the species:
>>> for node in species.xpath("child::node()"):
... print type(node), node
...
<class 'lxml.etree._ElementStringResult'>
Mammals:
<type 'lxml.etree._Element'> <Element dog at 0xe0b3c0>
<class 'lxml.etree._ElementStringResult'>
<type 'lxml.etree._Element'> <Element cat at 0xe0b410>
<class 'lxml.etree._ElementStringResult'>
Reptiles:
<type 'lxml.etree._Element'> <Element snake at 0xe0b460>
<class 'lxml.etree._ElementStringResult'>
<type 'lxml.etree._Element'> <Element turtle at 0xe0b4b0>
<class 'lxml.etree._ElementStringResult'>
Birds:
<type 'lxml.etree._Element'> <Element seagull at 0xe0b500>
<class 'lxml.etree._ElementStringResult'>
<type 'lxml.etree._Element'> <Element owl at 0xe0b550>
<class 'lxml.etree._ElementStringResult'>
So you can build it from there:
my_species = {}
current_class = None
for node in species.xpath("child::node()"):
if isinstance(node, lxml.etree._ElementStringResult):
text = node.strip(' \n\t:')
if text:
current_class = my_species.setdefault(text, [])
elif isinstance(node, lxml.etree._Element):
if current_class is not None:
current_class.append(node.tag)
print my_species
results in
{'Mammals': ['dog', 'cat'], 'Reptiles': ['snake', 'turtle'], 'Birds': ['seagull', 'owl']}
This is all fragile... small changes in how the text nodes are arranged can mess up the parsing.
The answer by @tdelaney is basically right, but I want to point to one nuance of Python element tree API. Here's a quote from the lxml
tutorial:
Elements can contain text:
<root>TEXT</root>
In many XML documents (data-centric documents), this is the only place where text can be found. It is encapsulated by a leaf tag at the very bottom of the tree hierarchy.
However, if XML is used for tagged text documents such as (X)HTML, text can also appear between different elements, right in the middle of the tree:
<html><body>Hello<br/>World</body></html>
Here, the
<br/>
tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through theirtail
property. It contains the text that directly follows the element, up to the next element in the XML tree.The two properties
text
andtail
are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).
Taking these properties into account it is possible to retrieve document text without forcing the tree to output text nodes.
#!/usr/bin/env python3.3
import itertools
from pprint import pprint
try:
from lxml import etree
except ImportError:
from xml.etree import cElementTree as etree
def textAndElement(node):
'''In py33+ recursive generators are easy'''
yield node
text = node.text.strip() if node.text else None
if text:
yield text
for child in node:
yield from textAndElement(child)
tail = node.tail.strip() if node.tail else None
if tail:
yield tail
if __name__ == '__main__':
xml = '''
<species>
Mammals: <dog/> <cat/>
Reptiles: <snake/> <turtle/>
Birds: <seagull/> <owl/>
</species>
'''
doc = etree.fromstring(xml)
pprint(list(textAndElement(doc)))
#[<Element species at 0x7f2c538727d0>,
#'Mammals:',
#<Element dog at 0x7f2c538728c0>,
#<Element cat at 0x7f2c53872910>,
#'Reptiles:',
#<Element snake at 0x7f2c53872960>,
#<Element turtle at 0x7f2c538729b0>,
#'Birds:',
#<Element seagull at 0x7f2c53872a00>,
#<Element owl at 0x7f2c53872a50>]
gen = textAndElement(doc)
next(gen) # skip root
groups = []
for _, g in itertools.groupby(gen, type):
groups.append(tuple(g))
pprint(dict(zip(*[iter(groups)] * 2)) )
#{('Birds:',): (<Element seagull at 0x7fc37f38aaa0>,
# <Element owl at 0x7fc37f38a820>),
#('Mammals:',): (<Element dog at 0x7fc37f38a960>,
# <Element cat at 0x7fc37f38a9b0>),
#('Reptiles:',): (<Element snake at 0x7fc37f38aa00>,
# <Element turtle at 0x7fc37f38aa50>)}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With