I have a very large XML file (20GB to be exact, and yes, I need all of it). When I attempt to load the file, I receive this error:
Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "file.py", line 5, in <module>
code = xml.read()
MemoryError
This is the current code I have, to read the XML file:
from bs4 import BeautifulSoup
xml = open('pages_full.xml', 'r')
code = xml.read()
xml.close()
soup = BeautifulSoup(code)
Now, how would I go about to eliminating this error and be able to continue working on the script. I would try splitting the file into separate files, but as I don't know how that would affect BeautifulSoup as well as the XML data, I'd rather not do this.
(The XML data is a database dump from a wiki I volunteer on, using it to import data from different time-periods, using the direct information from many pages)
You can use default text editors, which come with your computer, like Notepad on Windows or TextEdit on Mac. All you have to do is locate the XML file, right-click the XML file, and select the "Open With" option. This will display a list of programs to open the file.
You need to use text editor to save the XML files. Some examples of text editor are Notepad (native windows program) and Notepad++.
XML files can be opened in a browser like IE or Chrome, with any text editor like Notepad or MS-Word. Even Excel can be used to open XML files.
Do not use BeautifulSoup to try and such a large parse XML file. Use the ElementTree API instead. Specifically, use the iterparse()
function to parse your file as a stream, handle information as you are notified of elements, then delete the elements again:
from xml.etree import ElementTree as ET
parser = ET.iterparse(filename)
for event, element in parser:
# element is a whole element
if element.tag == 'yourelement'
# do something with this element
# then clean up
element.clear()
By using a event-driven approach, you never need to hold the whole XML document in memory, you only extract what you need and discard the rest.
See the iterparse()
tutorial and documentation.
Alternatively, you can also use the lxml library; it offers the same API in a faster and more featurefull package.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With