What is the fastest way to parse large XML docs in Python?

Tags:

I am currently running the following code based on Chapter 12.5 of the Python Cookbook:

from xml.parsers import expat  class Element(object):     def __init__(self, name, attributes):         self.name = name         self.attributes = attributes         self.cdata = ''         self.children = []     def addChild(self, element):         self.children.append(element)     def getAttribute(self,key):         return self.attributes.get(key)     def getData(self):         return self.cdata     def getElements(self, name=''):         if name:             return [c for c in self.children if c.name == name]         else:             return list(self.children)  class Xml2Obj(object):     def __init__(self):         self.root = None         self.nodeStack = []     def StartElement(self, name, attributes):         element = Element(name.encode(), attributes)         if self.nodeStack:             parent = self.nodeStack[-1]             parent.addChild(element)         else:             self.root = element         self.nodeStack.append(element)     def EndElement(self, name):         self.nodeStack.pop()     def CharacterData(self,data):         if data.strip():             data = data.encode()             element = self.nodeStack[-1]             element.cdata += data     def Parse(self, filename):         Parser = expat.ParserCreate()         Parser.StartElementHandler = self.StartElement         Parser.EndElementHandler = self.EndElement         Parser.CharacterDataHandler = self.CharacterData         ParserStatus = Parser.Parse(open(filename).read(),1)         return self.root

I am working with XML documents of about 1 GB in size. Does anyone know a faster way to parse these?

723

asked Nov 27 '08 16:11

Jeroen Dirks

1 Answers

I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.

Note however, Fredriks advice on using cElementTree iterparse function:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):     if elem.tag == "record":         ... process record elements ...         elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable context = iterparse(source, events=("start", "end"))  # turn it into an iterator context = iter(context)  # get the root element event, root = context.next()  for event, elem in context:     if event == "end" and elem.tag == "record":         ... process record elements ...         root.clear()

The lxml.iterparse() does not allow this.

The previous does not work on Python 3.7, consider the following way to get the first element.

import xml.etree.ElementTree as ET  # Get an iterable. context = ET.iterparse(source, events=("start", "end"))      for index, (event, elem) in enumerate(context):     # Get the root element.     if index == 0:         root = elem     if event == "end" and elem.tag == "record":         # ... process record elements ...         root.clear()

173

answered Sep 30 '22 14:09

Steen

Related questions
                            
                                Python type hinting with exceptions
                            
                                stopping setup.py from installing as egg
                            
                                Get webpage contents with Python?
                            
                                How can I use seaborn without changing the matplotlib defaults?
                            
                                Nested f-strings
                            
                                Rearrange columns of numpy 2D array
                            
                                Set legend symbol opacity with matplotlib?
                            
                                Python Sound ("Bell")
                            
                                Send log messages from all celery tasks to a single file
                            
                                python copy files by wildcards
                            
                                How to add if condition in a TensorFlow graph?
                            
                                logging remove / inspect / modify handlers configured by fileConfig()
                            
                                How do I use subprocess.Popen to connect multiple processes by pipes?
                            
                                How to decorate a method inside a class?
                            
                                Python - calendar.timegm() vs. time.mktime()
                            
                                Ansible creating a virtualenv
                            
                                How to evaluate environment variables into a string in Python?
                            
                                How to pickle a namedtuple instance correctly
                            
                                Continuous Integration System for a Python Codebase
                            
                                Binary buffer in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to parse large XML docs in Python?

Tags:

performance

python

parsing

xml

Jeroen Dirks

People also ask

1 Answers

Steen

Recent Activity

Donate For Us