ElementTree iterparse strategy

Tags:

I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing).

My concern is the following, imagine you have an xml like this

Click to copy

<?xml version="1.0" encoding="UTF-8" ?>
<families>
  <family>
    <name>Simpson</name>
    <members>
        <name>Homer</name>
        <name>Marge</name>
        <name>Bart</name>
    </members>
  </family>
  <family>
    <name>Griffin</name>
    <members>
        <name>Peter</name>
        <name>Brian</name>
        <name>Meg</name>
    </members>
  </family>
</families>

The problem is, of course to know when I am getting a family name (as Simpsons) and when I am getting the name of one of that family member (for example Homer)

What I have been doing so far is to use "switches" which will tell me if I am inside a "members" tag or not, the code will look like this

Click to copy

import xml.etree.cElementTree as ET

__author__ = 'moriano'

file_path = "test.xml"
context = ET.iterparse(file_path, events=("start", "end"))

# turn it into an iterator
context = iter(context)
on_members_tag = False
for event, elem in context:
    tag = elem.tag
    value = elem.text
    if value :
        value = value.encode('utf-8').strip()

    if event == 'start' :
        if tag == "members" :
            on_members_tag = True

        elif tag == 'name' :
            if on_members_tag :
                print "The member of the family is %s" % value
            else :
                print "The family is %s " % value

    if event == 'end' and tag =='members' :
        on_members_tag = False
    elem.clear()

And this works fine as the output is

Click to copy

The family is Simpson 
The member of the family is Homer
The member of the family is Marge
The member of the family is Bart
The family is Griffin 
The member of the family is Peter
The member of the family is Brian
The member of the family is Meg

My concern is that with this (simple) example i had to create an extra variable to know in which tag i was (on_members_tag) imagine with the true xml examples that I have to handle, they have more nested tags.

Also note that this is a very reduced example, so you can assume that i may be facing an xml with more tags, more inner tags and trying to get different tag names, attributes and so on.

So question is. Am I doing something horribly stupid here? I feel like there must be a more elegant solution to this.

700

asked Oct 09 '12 04:10

Juan Antonio Gomez Moriano

2 Answers

Here's one possible approach: we maintain a path list and peek backwards to find the parent node(s).

Click to copy

path = []
for event, elem in ET.iterparse(file_path, events=("start", "end")):
    if event == 'start':
        path.append(elem.tag)
    elif event == 'end':
        # process the tag
        if elem.tag == 'name':
            if 'members' in path:
                print 'member'
            else:
                print 'nonmember'
        path.pop()

145

answered Oct 03 '22 22:10

nneonneo

pulldom is excellent for this. You get a sax stream. You can iterate through the stream, and when you find a node that your are interested in, load that node in to a dom fragment.

Click to copy

import xml.dom.pulldom as pulldom
import xpath # from http://code.google.com/p/py-dom-xpath/

events = pulldom.parse('families.xml')
for event, node in events:
    if event == 'START_ELEMENT' and node.tagName=='family':
        events.expandNode(node) # node now contains a dom fragment
        family_name = xpath.findvalue('name', node)
        members = xpath.findvalues('members/name', node)
        print('family name: {0}, members: {1}'.format(family_name, members))

output:

Click to copy

family name: Simpson, members: [u'Hommer', u'Marge', u'Bart']
family name: Griffin, members: [u'Peter', u'Brian', u'Meg']

answered Sep 29 '22 22:09

Gary van der Merwe

Related questions
                            
                                How to detect rectangle in a rectangle?
                            
                                Working with binary PNG images in PIL/pillow
                            
                                Webhooks for slot filling
                            
                                Construct python dict from DeepDiff result
                            
                                Determine the window size turtle python setup
                            
                                Resolve a variable name given only a stack frame object
                            
                                Python Pillow's thumbnail method returning None
                            
                                TypeError: string indices must be integers (Python) [duplicate]
                            
                                Should I ever directly call object.__str__()?
                            
                                Get the positive and negative words from a Textblob based on its polarity in Python (Sentimental analysis)
                            
                                Pyinstaller : program that reads a csv
                            
                                Vectorized pythonic way to get count of elements greater than current element
                            
                                Combine 'toc' and 'hide input' when using nbconvert html export
                            
                                Permission Error: Using Image.open
                            
                                How to resize Moviepy to fullscreen?
                            
                                Confused on a for loop for a hangman game?
                            
                                Why is pip installing Pillow for OS X 10.12, when I have OS X 10.11 installed?
                            
                                Parallel threads with TensorFlow Dataset API and flat_map
                            
                                How does data normalization work in keras during prediction?
                            
                                Detecting Mouse clicks in windows using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

ElementTree iterparse strategy

Tags:

python

xml

elementtree

sax

iterparse

Juan Antonio Gomez Moriano

People also ask

2 Answers

nneonneo

Gary van der Merwe

Recent Activity

Donate For Us