XML parsing - ElementTree vs SAX and DOM

2 Answers

ElementTree is much easier to use, because it represents an XML tree (basically) as a structure of lists, and attributes are represented as dictionaries.

ElementTree needs much less memory for XML trees than DOM (and thus is faster), and the parsing overhead via iterparse is comparable to SAX. Additionally, iterparse returns partial structures, and you can keep memory usage constant during parsing by discarding the structures as soon as you process them.

ElementTree, as in Python 2.5, has only a small feature set compared to full-blown XML libraries, but it's enough for many applications. If you need a validating parser or complete XPath support, lxml is the way to go. For a long time, it used to be quite unstable, but I haven't had any problems with it since 2.1.

ElementTree deviates from DOM, where nodes have access to their parent and siblings. Handling actual documents rather than data stores is also a bit cumbersome, because text nodes aren't treated as actual nodes. In the XML snippet

<a>This is <b>a</b> test</a>

The string test will be the so-called tail of element b.

In general, I recommend ElementTree as the default for all XML processing with Python, and DOM or SAX as the solutions for specific problems.

132

answered Oct 09 '22 20:10

Torsten Marek

Minimal DOM implementation:

Link.

Python supplies a full, W3C-standard implementation of XML DOM (xml.dom) and a minimal one, xml.dom.minidom. This latter one is simpler and smaller than the full implementation. However, from a "parsing perspective", it has all the pros and cons of the standard DOM - i.e. it loads everything in memory.

Considering a basic XML file:

<?xml version="1.0"?> <catalog>     <book isdn="xxx-1">       <author>A1</author>       <title>T1</title>     </book>     <book isdn="xxx-2">       <author>A2</author>       <title>T2</title>     </book> </catalog>

A possible Python parser using minidom is:

import os from xml.dom import minidom from xml.parsers.expat import ExpatError  #-------- Select the XML file: --------# #Current file name and directory: curpath = os.path.dirname( os.path.realpath(__file__) ) filename = os.path.join(curpath, "sample.xml") #print "Filename: %s" % (filename)  #-------- Parse the XML file: --------# try:     #Parse the given XML file:     xmldoc = minidom.parse(filepath) except ExpatError as e:     print "[XML] Error (line %d): %d" % (e.lineno, e.code)     print "[XML] Offset: %d" % (e.offset)     raise e except IOError as e:     print "[IO] I/O Error %d: %s" % (e.errno, e.strerror)     raise e else:     catalog = xmldoc.documentElement     books = catalog.getElementsByTagName("book")      for book in books:         print book.getAttribute('isdn')         print book.getElementsByTagName('author')[0].firstChild.data         print book.getElementsByTagName('title')[0].firstChild.data

Note that xml.parsers.expat is a Python interface to the Expat non-validating XML parser (docs.python.org/2/library/pyexpat.html).

The xml.dom package supplies also the exception class DOMException, but it is not supperted in minidom!

The ElementTree XML API:

Link.

ElementTree is much easier to use and it requires less memory than XML DOM. Furthermore, a C implementation is available (xml.etree.cElementTree).

A possible Python parser using ElementTree is:

import os from xml.etree import cElementTree  # C implementation of xml.etree.ElementTree from xml.parsers.expat import ExpatError  # XML formatting errors  #-------- Select the XML file: --------# #Current file name and directory: curpath = os.path.dirname( os.path.realpath(__file__) ) filename = os.path.join(curpath, "sample.xml") #print "Filename: %s" % (filename)  #-------- Parse the XML file: --------# try:     #Parse the given XML file:     tree = cElementTree.parse(filename) except ExpatError as e:     print "[XML] Error (line %d): %d" % (e.lineno, e.code)     print "[XML] Offset: %d" % (e.offset)     raise e except IOError as e:     print "[XML] I/O Error %d: %s" % (e.errno, e.strerror)     raise e else:     catalogue = tree.getroot()      for book in catalogue:         print book.attrib.get("isdn")         print book.find('author').text         print book.find('title').text

answered Oct 09 '22 20:10

Paolo Rovelli

Related questions
                            
                                Multi-level defaultdict with variable depth?
                            
                                Python: The _imagingft C module is not installed
                            
                                "The headers or library files could not be found for jpeg" installing Pillow on Alpine Linux
                            
                                Installed Python 3 on Mac OS X but its still Python 2.7
                            
                                Threading in Python [closed]
                            
                                Visual Studio Code pylint: Unable to import 'protorpc'
                            
                                Playing mp3 song on python
                            
                                finding first day of the month in python
                            
                                Pairwise circular Python 'for' loop
                            
                                Is there any way to use pythonappend with SWIG's new builtin feature?
                            
                                Infinite integer in Python
                            
                                How to replicate tee behavior in Python when using subprocess?
                            
                                Python: self.__class__ vs. type(self) [duplicate]
                            
                                Vim automatically removes indentation on Python comments [duplicate]
                            
                                TypeError: 'tuple' object does not support item assignment when swapping values
                            
                                dict.fromkeys all point to same list
                            
                                Log output of multiprocessing.Process
                            
                                What is python-dev package used for
                            
                                Why does Python allow out-of-range slice indexes for sequences?
                            
                                Why do list comprehensions write to the loop variable, but generators don't? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

XML parsing - ElementTree vs SAX and DOM

Tags:

python

dom

xml

elementtree

sax

Corey Goldberg

People also ask