Using Python Iterparse For Large XML Files

Tags:

I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB). I wanted to use iterparse in lxml to do it.

My file is of the format:

<item>   <title>Item 1</title>   <desc>Description 1</desc> </item> <item>   <title>Item 2</title>   <desc>Description 2</desc> </item>

and so far my solution is:

from lxml import etree  context = etree.iterparse( MYFILE, tag='item' )  for event, elem in context :       print elem.xpath( 'description/text( )' )  del context

Unfortunately though, this solution is still eating up a lot of memory. I think the problem is that after dealing with each "ITEM" I need to do something to cleanup empty children. Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?

717

asked Aug 24 '11 06:08

Dave Johnshon

2 Answers

Try Liza Daly's fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings.

def fast_iter(context, func, *args, **kwargs):     """     http://lxml.de/parsing.html#modifying-the-tree     Based on Liza Daly's fast_iter     http://www.ibm.com/developerworks/xml/library/x-hiperfparse/     See also http://effbot.org/zone/element-iterparse.htm     """     for event, elem in context:         func(elem, *args, **kwargs)         # It's safe to call clear() here because no descendants will be         # accessed         elem.clear()         # Also eliminate now-empty references from the root node to elem         for ancestor in elem.xpath('ancestor-or-self::*'):             while ancestor.getprevious() is not None:                 del ancestor.getparent()[0]     del context   def process_element(elem):     print elem.xpath( 'description/text( )' )  context = etree.iterparse( MYFILE, tag='item' ) fast_iter(context,process_element)

Daly's article is an excellent read, especially if you are processing large XML files.

Edit: The fast_iter posted above is a modified version of Daly's fast_iter. After processing an element, it is more aggressive at removing other elements that are no longer needed.

The script below shows the difference in behavior. Note in particular that orig_fast_iter does not delete the A1 element, while the mod_fast_iter does delete it, thus saving more memory.

import lxml.etree as ET import textwrap import io  def setup_ABC():     content = textwrap.dedent('''\       <root>         <A1>           <B1></B1>           <C>1<D1></D1></C>           <E1></E1>         </A1>         <A2>           <B2></B2>           <C>2<D></D></C>           <E2></E2>         </A2>       </root>         ''')     return content   def study_fast_iter():     def orig_fast_iter(context, func, *args, **kwargs):         for event, elem in context:             print('Processing {e}'.format(e=ET.tostring(elem)))             func(elem, *args, **kwargs)             print('Clearing {e}'.format(e=ET.tostring(elem)))             elem.clear()             while elem.getprevious() is not None:                 print('Deleting {p}'.format(                     p=(elem.getparent()[0]).tag))                 del elem.getparent()[0]         del context      def mod_fast_iter(context, func, *args, **kwargs):         """         http://www.ibm.com/developerworks/xml/library/x-hiperfparse/         Author: Liza Daly         See also http://effbot.org/zone/element-iterparse.htm         """         for event, elem in context:             print('Processing {e}'.format(e=ET.tostring(elem)))             func(elem, *args, **kwargs)             # It's safe to call clear() here because no descendants will be             # accessed             print('Clearing {e}'.format(e=ET.tostring(elem)))             elem.clear()             # Also eliminate now-empty references from the root node to elem             for ancestor in elem.xpath('ancestor-or-self::*'):                 print('Checking ancestor: {a}'.format(a=ancestor.tag))                 while ancestor.getprevious() is not None:                     print(                         'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))                     del ancestor.getparent()[0]         del context      content = setup_ABC()     context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')     orig_fast_iter(context, lambda elem: None)     # Processing <C>1<D1/></C>     # Clearing <C>1<D1/></C>     # Deleting B1     # Processing <C>2<D/></C>     # Clearing <C>2<D/></C>     # Deleting B2      print('-' * 80)     """     The improved fast_iter deletes A1. The original fast_iter does not.     """     content = setup_ABC()     context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')     mod_fast_iter(context, lambda elem: None)     # Processing <C>1<D1/></C>     # Clearing <C>1<D1/></C>     # Checking ancestor: root     # Checking ancestor: A1     # Checking ancestor: C     # Deleting B1     # Processing <C>2<D/></C>     # Clearing <C>2<D/></C>     # Checking ancestor: root     # Checking ancestor: A2     # Deleting A1     # Checking ancestor: C     # Deleting B2  study_fast_iter()

126

answered Oct 13 '22 13:10

unutbu

iterparse() lets you do stuff while building the tree, that means that unless you remove what you don't need anymore, you'll still end up with the whole tree in the end.

For more information: read this by the author of the original ElementTree implementation (but it's also applicable to lxml)

answered Oct 13 '22 13:10

Steven

Related questions
                            
                                Python csv.DictReader: parse string?
                            
                                How to import functions from other projects in Python?
                            
                                How to limit mongo query in python
                            
                                Is there an easy way to convert ISO 8601 duration to timedelta?
                            
                                What does " -r " do in pip install -r requirements.txt
                            
                                python's webbrowser launches IE, instead of default browser, on Windows relative path
                            
                                Calling private function within the same class python
                            
                                Pandas: Combining Two DataFrames Horizontally [duplicate]
                            
                                Python - Flask Default Route possible?
                            
                                Python debugger tells me value of Numpy array is "*** Newest frame"
                            
                                What is the difference between "a is b" and "id(a) == id(b)" in Python?
                            
                                Python Implementation of Viterbi Algorithm
                            
                                Testing for positive infinity, or negative infinity, individually in Python
                            
                                How to avoid overlapping of labels & autopct in a matplotlib pie chart?
                            
                                Can't find msguniq. Make sure you have GNU gettext tools 0.15 or newer installed. (Django 1.8 and OSX ElCapitan)
                            
                                Django template filters, tags, simple_tags, and inclusion_tags
                            
                                moment.calendar() without the time
                            
                                How can I use cumsum within a group in Pandas?
                            
                                vim and python scripts debugging
                            
                                Simple IPC between C++ and Python (cross platform)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Python Iterparse For Large XML Files

Tags:

python

xml

large-files

lxml

elementtree

Dave Johnshon

People also ask

2 Answers

unutbu

Steven

Recent Activity

Donate For Us