Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to store large files in Python

I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing them to file via write(). Now I am using pickle. Although it works, the files are incredibly large (> 5 GB). I have little experience in the field of such large files. I wanted to know if it would be faster, or even possible, to zip this pickle file prior to storing it to memory.

like image 692
puk Avatar asked Oct 03 '11 22:10

puk


2 Answers

Python code would be extremely slow when it comes to implementing data serialization. If you try to create an equivalent to Pickle in pure Python, you'll see that it will be super slow. Fortunately the built-in modules which perform that are quite good.

Apart from cPickle, you will find the marshal module, which is a lot faster. But it needs a real file handle (not from a file-like object). You can import marshal as Pickle and see the difference. I don't think you can make a custom serializer which is a lot faster than this...

Here's an actual (not so old) serious benchmark of Python serializers

like image 175
cJ Zougloub Avatar answered Sep 20 '22 15:09

cJ Zougloub


You can compress the data with bzip2:

from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib

hugeData = {'key': {'x': 1, 'y':2}}
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'wb')) as f:
  json.dump(hugeData, f)

Load it like this:

from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib

with contextlib.closing(bz2.BZ2File('data.json.bz2', 'rb')) as f:
  hugeData = json.load(f)

You can also compress the data using zlib or gzip with pretty much the same interface. However, both zlib and gzip's compression rates will be lower than the one achieved with bzip2 (or lzma).

like image 35
phihag Avatar answered Sep 18 '22 15:09

phihag