I am trying to serialize a large python object, composed of a tuple of numpy arrays using pickle/cPickle and gzip. The procedure works well up to a certain size of the data, and after that I receive the following error:
--> 121 cPickle.dump(dataset_pickle, f)
***/gzip.pyc in write(self, data)
238 print(type(self.crc))
239 print(self.crc)
--> 240 self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
241 self.fileobj.write( self.compress.compress(data) )
OverflowError: size does not fit in an int
The size of the numpy array is around 1.5 GB and the string sent to zlib.crc32 exceeds 2 GB. I am working on a 64-bit machine and my Python is also 64-bit
>>> import sys
>>> sys.maxsize
9223372036854775807
Is it a bug with python or am I doing something wrong? Are there any good alternatives for compressing and serializing numpy arrays? I am taking a look at numpy.savez, PyTables and HDF5 right now, but it would be good to know why I am having this problems since I have enough memory
Update: I remember reading somewhere that this could be caused by using an old version of Numpy (and I was), but I've fully switched to numpy.save/savez instead which is actually faster than cPickle (at least in my case)
This seems to be a bug in python 2.7
https://bugs.python.org/issue23306
From inspecting the bug report, it does not look like there is a pending solution to it. Your best bet would be to move to python 3 which apparently did not exhibit this bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With