I have a script in Python to compress big string:
import zlib
def processFiles():
  ...
  s = """Large string more than 2Gb"""
  data = zlib.compress(s)
  ...
When I run this script, I got a error:
ERROR: Traceback (most recent call last):#012  File "./../commands/sce.py", line 438, in processFiles#012    data = zlib.compress(s)#012OverflowError: size does not fit in an int
Some information:
zlib.version = '1.0'
zlib.ZLIB_VERSION = '1.2.7'
# python -V
Python 2.7.3
# uname -a
Linux app2 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux
# free
             total       used       free     shared    buffers     cached
Mem:      65997404    8096588   57900816          0     184260    7212252
-/+ buffers/cache:     700076   65297328
Swap:     35562236          0   35562236
# ldconfig -p | grep python
libpython2.7.so.1.0 (libc6,x86-64) => /usr/lib/libpython2.7.so.1.0
libpython2.7.so (libc6,x86-64) => /usr/lib/libpython2.7.so
How to compress big data (more than 2Gb) in Python?
To create your own compressed ZIP files, you must open the ZipFile object in write mode by passing 'w' as the second argument. When you pass a path to the write() method of a ZipFile object, Python will compress the file at that path and add it into the ZIP file.
compress(text) should be compressed = zlib. compress(text. encode()) . It seems that this is most effective with longer strings.
After saving the model into HDF5, you need to load the model and save the weights of it. By this H5 or HDF5 file size will be reduced.
Standard python pickle, thinly wrapped with standard compression libraries. The standard pickle package provides an excellent default tool for serializing arbitrary python objects and storing them to disk. Standard python also includes broad set of data compression packages.
My function to compress large data:
def compressData(self, s):
    compressed = ''
    begin = 0
    blockSize = 1073741824 # 1Gb
    compressor = zlib.compressobj()
    while begin < len(s):
      compressed = compressed + compressor.compress(s[begin:begin + blockSize])
      begin = begin + blockSize
    compressed = compressed + compressor.flush()
    return compressed
                        This is not a RAM issue. Simply either zlib or the python binding cannot handle data larger than 4GB.
Split your data into 4GB (or smaller chunks) and process each one separately.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With