I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing).
My concern is the following, imagine you have an xml like this
<?xml version="1.0" encoding="UTF-8" ?>
<families>
<family>
<name>Simpson</name>
<members>
<name>Homer</name>
<name>Marge</name>
<name>Bart</name>
</members>
</family>
<family>
<name>Griffin</name>
<members>
<name>Peter</name>
<name>Brian</name>
<name>Meg</name>
</members>
</family>
</families>
The problem is, of course to know when I am getting a family name (as Simpsons) and when I am getting the name of one of that family member (for example Homer)
What I have been doing so far is to use "switches" which will tell me if I am inside a "members" tag or not, the code will look like this
import xml.etree.cElementTree as ET
__author__ = 'moriano'
file_path = "test.xml"
context = ET.iterparse(file_path, events=("start", "end"))
# turn it into an iterator
context = iter(context)
on_members_tag = False
for event, elem in context:
tag = elem.tag
value = elem.text
if value :
value = value.encode('utf-8').strip()
if event == 'start' :
if tag == "members" :
on_members_tag = True
elif tag == 'name' :
if on_members_tag :
print "The member of the family is %s" % value
else :
print "The family is %s " % value
if event == 'end' and tag =='members' :
on_members_tag = False
elem.clear()
And this works fine as the output is
The family is Simpson
The member of the family is Homer
The member of the family is Marge
The member of the family is Bart
The family is Griffin
The member of the family is Peter
The member of the family is Brian
The member of the family is Meg
My concern is that with this (simple) example i had to create an extra variable to know in which tag i was (on_members_tag) imagine with the true xml examples that I have to handle, they have more nested tags.
Also note that this is a very reduced example, so you can assume that i may be facing an xml with more tags, more inner tags and trying to get different tag names, attributes and so on.
So question is. Am I doing something horribly stupid here? I feel like there must be a more elegant solution to this.
If the parse mode is "xml", this is an ElementTree instance. If the parse mode is “text”, this is a Unicode string. If the loader fails, it can return None or raise an exception. New in version 3.9: The base_url and max_depth parameters. class xml.etree.ElementTree. Element (tag, attrib={}, **extra) ¶ Element class.
The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available. Deprecated since version 3.3: The xml.etree.cElementTree module is deprecated.
This is a short tutorial for using xml.etree.ElementTree ( ET in short). The goal is to demonstrate some of the building blocks and basic concepts of the module. XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree.
Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.
Here's one possible approach: we maintain a path list and peek backwards to find the parent node(s).
path = []
for event, elem in ET.iterparse(file_path, events=("start", "end")):
if event == 'start':
path.append(elem.tag)
elif event == 'end':
# process the tag
if elem.tag == 'name':
if 'members' in path:
print 'member'
else:
print 'nonmember'
path.pop()
pulldom is excellent for this. You get a sax stream. You can iterate through the stream, and when you find a node that your are interested in, load that node in to a dom fragment.
import xml.dom.pulldom as pulldom
import xpath # from http://code.google.com/p/py-dom-xpath/
events = pulldom.parse('families.xml')
for event, node in events:
if event == 'START_ELEMENT' and node.tagName=='family':
events.expandNode(node) # node now contains a dom fragment
family_name = xpath.findvalue('name', node)
members = xpath.findvalues('members/name', node)
print('family name: {0}, members: {1}'.format(family_name, members))
output:
family name: Simpson, members: [u'Hommer', u'Marge', u'Bart']
family name: Griffin, members: [u'Peter', u'Brian', u'Meg']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With