Loading huge XML files and dealing with MemoryError

Tags:

I have a very large XML file (20GB to be exact, and yes, I need all of it). When I attempt to load the file, I receive this error:

Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "file.py", line 5, in <module>
    code = xml.read()
MemoryError

This is the current code I have, to read the XML file:

from bs4 import BeautifulSoup
xml = open('pages_full.xml', 'r')
code = xml.read()
xml.close()
soup = BeautifulSoup(code)

Now, how would I go about to eliminating this error and be able to continue working on the script. I would try splitting the file into separate files, but as I don't know how that would affect BeautifulSoup as well as the XML data, I'd rather not do this.

(The XML data is a database dump from a wiki I volunteer on, using it to import data from different time-periods, using the direct information from many pages)

269

asked Feb 17 '13 17:02

Hairr

1 Answers

Do not use BeautifulSoup to try and such a large parse XML file. Use the ElementTree API instead. Specifically, use the iterparse() function to parse your file as a stream, handle information as you are notified of elements, then delete the elements again:

from xml.etree import ElementTree as ET

parser = ET.iterparse(filename)

for event, element in parser:
    # element is a whole element
    if element.tag == 'yourelement'
         # do something with this element
         # then clean up
         element.clear()

By using a event-driven approach, you never need to hold the whole XML document in memory, you only extract what you need and discard the rest.

See the iterparse() tutorial and documentation.

Alternatively, you can also use the lxml library; it offers the same API in a faster and more featurefull package.

answered Sep 22 '22 06:09

Martijn Pieters

Related questions
                            
                                datetime in defining database using sqlalchemy
                            
                                How can I log all outgoing email in Django?
                            
                                MongoDB - Upsert with increment
                            
                                What programming language features are well suited for developing a live coding framework?
                            
                                Generate .pyc from Python AST?
                            
                                Validate a filename in python
                            
                                a mutable type inside an immutable container
                            
                                Python - SSL Issue with Oauth2
                            
                                matplotlib: Creating two (stacked) subplots with SHARED X axis but SEPARATE Y axis values
                            
                                tkinter and time.sleep
                            
                                logger.info(traceback.print_exc()) coming on python gui
                            
                                Why is django's settings object a LazyObject?
                            
                                SciPy instead of GNU Octave
                            
                                Set global output precision python
                            
                                User-defined exception: <unprintable ... object>
                            
                                Missing bootstrap resources in Django-Rest-Framework
                            
                                Why does PyCrypto not use the default IV?
                            
                                Django 1.4 - Redirect to Non-HTTP urls
                            
                                How to I delete all Flask sessions?
                            
                                Pass another object to the main flask application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Loading huge XML files and dealing with MemoryError

Tags:

python

xml

beautifulsoup

mediawiki

Hairr

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us