Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading huge XML files and dealing with MemoryError

I have a very large XML file (20GB to be exact, and yes, I need all of it). When I attempt to load the file, I receive this error:

Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "file.py", line 5, in <module>
    code = xml.read()
MemoryError

This is the current code I have, to read the XML file:

from bs4 import BeautifulSoup
xml = open('pages_full.xml', 'r')
code = xml.read()
xml.close()
soup = BeautifulSoup(code)

Now, how would I go about to eliminating this error and be able to continue working on the script. I would try splitting the file into separate files, but as I don't know how that would affect BeautifulSoup as well as the XML data, I'd rather not do this.

(The XML data is a database dump from a wiki I volunteer on, using it to import data from different time-periods, using the direct information from many pages)

like image 269
Hairr Avatar asked Feb 17 '13 17:02

Hairr


People also ask

How do I open a heavy XML file?

You can use default text editors, which come with your computer, like Notepad on Windows or TextEdit on Mac. All you have to do is locate the XML file, right-click the XML file, and select the "Open With" option. This will display a list of programs to open the file.

How do I open and edit a large XML file?

You need to use text editor to save the XML files. Some examples of text editor are Notepad (native windows program) and Notepad++.

How do I open a 1gb XML file?

XML files can be opened in a browser like IE or Chrome, with any text editor like Notepad or MS-Word. Even Excel can be used to open XML files.


1 Answers

Do not use BeautifulSoup to try and such a large parse XML file. Use the ElementTree API instead. Specifically, use the iterparse() function to parse your file as a stream, handle information as you are notified of elements, then delete the elements again:

from xml.etree import ElementTree as ET

parser = ET.iterparse(filename)

for event, element in parser:
    # element is a whole element
    if element.tag == 'yourelement'
         # do something with this element
         # then clean up
         element.clear()

By using a event-driven approach, you never need to hold the whole XML document in memory, you only extract what you need and discard the rest.

See the iterparse() tutorial and documentation.

Alternatively, you can also use the lxml library; it offers the same API in a faster and more featurefull package.

like image 56
Martijn Pieters Avatar answered Sep 22 '22 06:09

Martijn Pieters