Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python gzip: OverflowError size does not fit in an int

I am trying to serialize a large python object, composed of a tuple of numpy arrays using pickle/cPickle and gzip. The procedure works well up to a certain size of the data, and after that I receive the following error:

--> 121     cPickle.dump(dataset_pickle, f)

    ***/gzip.pyc in write(self, data)
    238             print(type(self.crc))
    239             print(self.crc)
--> 240             self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
    241             self.fileobj.write( self.compress.compress(data) )

OverflowError: size does not fit in an int

The size of the numpy array is around 1.5 GB and the string sent to zlib.crc32 exceeds 2 GB. I am working on a 64-bit machine and my Python is also 64-bit

>>> import sys
>>> sys.maxsize
9223372036854775807

Is it a bug with python or am I doing something wrong? Are there any good alternatives for compressing and serializing numpy arrays? I am taking a look at numpy.savez, PyTables and HDF5 right now, but it would be good to know why I am having this problems since I have enough memory


Update: I remember reading somewhere that this could be caused by using an old version of Numpy (and I was), but I've fully switched to numpy.save/savez instead which is actually faster than cPickle (at least in my case)

like image 589
gsmafra Avatar asked May 21 '15 14:05

gsmafra


1 Answers

This seems to be a bug in python 2.7

https://bugs.python.org/issue23306

From inspecting the bug report, it does not look like there is a pending solution to it. Your best bet would be to move to python 3 which apparently did not exhibit this bug.

like image 90
Perennial Avatar answered Sep 27 '22 23:09

Perennial