Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory error in Python while parsing a 300 MB file

I'm parsing an xml file (291 MB) in python 3.5 with

import xmltodict, json

with open('Wikipedia-20160404094133.xml', encoding='utf-8') as xml_file:
    dic_xml = xmltodict.parse(xml_file.read(), encoding='utf-8', xml_attribs=True)

but I get the error:

dic_xml = xmltodict.parse(xml_file.read(), encoding='utf-8', xml_attribs=True)
MemoryError

What can I do to solve this?

like image 578
Knokkelgeddon Avatar asked Nov 08 '22 17:11

Knokkelgeddon


1 Answers

Check out this.

"xmltodict is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia"

Essentially, you need to read the file in chunks and xmltodict's "streaming mode" seems to be built for this.

like image 153
jDo Avatar answered Nov 14 '22 21:11

jDo