Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to parse large XML docs in Python?

I am currently running the following code based on Chapter 12.5 of the Python Cookbook:

from xml.parsers import expat  class Element(object):     def __init__(self, name, attributes):         self.name = name         self.attributes = attributes         self.cdata = ''         self.children = []     def addChild(self, element):         self.children.append(element)     def getAttribute(self,key):         return self.attributes.get(key)     def getData(self):         return self.cdata     def getElements(self, name=''):         if name:             return [c for c in self.children if c.name == name]         else:             return list(self.children)  class Xml2Obj(object):     def __init__(self):         self.root = None         self.nodeStack = []     def StartElement(self, name, attributes):         element = Element(name.encode(), attributes)         if self.nodeStack:             parent = self.nodeStack[-1]             parent.addChild(element)         else:             self.root = element         self.nodeStack.append(element)     def EndElement(self, name):         self.nodeStack.pop()     def CharacterData(self,data):         if data.strip():             data = data.encode()             element = self.nodeStack[-1]             element.cdata += data     def Parse(self, filename):         Parser = expat.ParserCreate()         Parser.StartElementHandler = self.StartElement         Parser.EndElementHandler = self.EndElement         Parser.CharacterDataHandler = self.CharacterData         ParserStatus = Parser.Parse(open(filename).read(),1)         return self.root 

I am working with XML documents of about 1 GB in size. Does anyone know a faster way to parse these?

like image 723
Jeroen Dirks Avatar asked Nov 27 '08 16:11

Jeroen Dirks


People also ask

How can I read XML faster?

XmlReader is one of the fastest ways of reading in an XML file. It is forward-only, and read-only. The derived XmlTextReader is generally the class you would reach for. Bear in mind that the speed improvement is only appreciable for very, very large XML files.

Which Python module is best suited for parsing XML documents?

Python XML Parsing Modules Python allows parsing these XML documents using two modules namely, the xml. etree. ElementTree module and Minidom (Minimal DOM Implementation).


1 Answers

I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.

Note however, Fredriks advice on using cElementTree iterparse function:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):     if elem.tag == "record":         ... process record elements ...         elem.clear() 

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable context = iterparse(source, events=("start", "end"))  # turn it into an iterator context = iter(context)  # get the root element event, root = context.next()  for event, elem in context:     if event == "end" and elem.tag == "record":         ... process record elements ...         root.clear() 

The lxml.iterparse() does not allow this.

The previous does not work on Python 3.7, consider the following way to get the first element.

import xml.etree.ElementTree as ET  # Get an iterable. context = ET.iterparse(source, events=("start", "end"))      for index, (event, elem) in enumerate(context):     # Get the root element.     if index == 0:         root = elem     if event == "end" and elem.tag == "record":         # ... process record elements ...         root.clear() 
like image 173
Steen Avatar answered Sep 30 '22 14:09

Steen