Python

Question

Essentially, I have a 6.4GB XML file that I'd like to convert to JSON then save it to disk. I'm currently running OSX 10.8.4 with an i7 2700k and 16GBs of ram, and running Python 64bit (double checked). I'm getting an error that I don't have enough memory to allocate. How do I go about fixing this?

print 'Opening'
f = open('large.xml', 'r')
data = f.read()
f.close()

print 'Converting'
newJSON = xmltodict.parse(data)

print 'Json Dumping'
newJSON = json.dumps(newJSON)

print 'Saving'
f = open('newjson.json', 'w')
f.write(newJSON)
f.close()

The Error:

Python(2461) malloc: *** mmap(size=140402048315392) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "/Users/user/Git/Resources/largexml2json.py", line 10, in <module>
    data = f.read()
MemoryError

Leonardo.Z · Accepted Answer

Many Python XML libraries support parsing XML sub elements incrementally, e.g. xml.etree.ElementTree.iterparse and xml.sax.parse in the standard library. These functions are usually called "XML Stream Parser".

The xmltodict library you used also has a streaming mode. I think it may solve your problem

https://github.com/martinblech/xmltodict#streaming-mode

spiralx · Answer

Instead of trying to read the file in one go and then process it, you want to read it in chunks and process each chunk as it's loaded. This is a fairly common situation when processing large XML files and is covered by the Simple API for XML (SAX) standard, which specifies a callback API for parsing XML streams - it's part of the Python standard library under xml.sax.parse and xml.etree.ETree as mentioned above.

Here's a quick XML to JSON converter:

from collections import defaultdict
import json
import xml.etree.ElementTree as ET

def parse_xml(file_name):
    events = ("start", "end")
    context = ET.iterparse(file_name, events=events)

    return pt(context)

def pt(context, cur_elem=None):
    items = defaultdict(list)

    if cur_elem:
        items.update(cur_elem.attrib)

    text = ""

    for action, elem in context:
        # print("{0:>6} : {1:20} {2:20} '{3}'".format(action, elem.tag, elem.attrib, str(elem.text).strip()))

        if action == "start":
            items[elem.tag].append(pt(context, elem))
        elif action == "end":
            text = elem.text.strip() if elem.text else ""
            elem.clear()
            break

    if len(items) == 0:
        return text

    return { k: v[0] if len(v) == 1 else v for k, v in items.items() }

if __name__ == "__main__":
    json_data = parse_xml("large.xml")
    print(json.dumps(json_data, indent=2))

If you're looking at a lot of XML processing check out the lxml library, it's got a ton of useful stuff over and above the standard modules, while also being much easier to use.

http://lxml.de/tutorial.html

Python - Convert Very Large (6.4GB) XML files to JSON

Tags:

json

xml

Dustin

2 Answers

Leonardo.Z

spiralx

Recent Activity

Donate For Us