Efficient reading of 800 GB XML file in Python 2.7

Tags:

I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.

Currently, I am just using open('foo.txt') with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.

A point in the right direction would be much appreciated.

346

asked Feb 13 '13 21:02

Mike S

1 Answers

The standard open() function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.

Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen() call (wfopen() on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.

The XML parsing done by iterparse reads the file in chunks of 16384 bytes (16kb).

If you want to control the buffersize, use the buffering keyword argument:

open('foo.xml', buffering=(2<<16) + 8)  # buffer enough for 8 full parser reads

which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.

The io.open() function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.

You could try and see if io.open('foo.xml', 'rb', buffering=2<<16) is going to perform any better. Opening in rb mode will give you a io.BufferedReader instance.

You do not want to use io.TextIOWrapper; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r (textmode) instead.

Using io.open() may give you more flexibility and a richer API, but the underlying C file object is opened using open() instead of fopen(), and all buffering is handled by the Python io.BufferedIOBase implementation.

Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.

177

answered Sep 21 '22 20:09

Martijn Pieters

Related questions
                            
                                It is better to have a synchronized block inside a try block or a try block inside a synchronized block?
                            
                                3D graphics library for .NET [closed]
                            
                                Why is there no float center in CSS?
                            
                                Why doesn't Python's nonlocal keyword like the global scope?
                            
                                Modern GUI programming in Python 3.3
                            
                                Git repo keeps showing modified files even in a fresh clone
                            
                                what's the relation of "win32 project" name in visual studio to x86 or x64 platform
                            
                                Is there a TTL for MySQL as there is in MongoDB?
                            
                                AWS Elastic Beanstalk error - ImportError: No module named flask.ext.sqlalchemy
                            
                                xcode 5 deprecation warning about glut functions
                            
                                HTML: In a select, is it required that the value attribute of each option be unique?
                            
                                How bad is it to mix and match Visual C++ runtime DLL files in one process?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With