Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble with compressing big data in python

Tags:

python

zlib

I have a script in Python to compress big string:

import zlib

def processFiles():
  ...
  s = """Large string more than 2Gb"""
  data = zlib.compress(s)
  ...

When I run this script, I got a error:

ERROR: Traceback (most recent call last):#012  File "./../commands/sce.py", line 438, in processFiles#012    data = zlib.compress(s)#012OverflowError: size does not fit in an int

Some information:

zlib.version = '1.0'

zlib.ZLIB_VERSION = '1.2.7'

# python -V
Python 2.7.3

# uname -a
Linux app2 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux

# free
             total       used       free     shared    buffers     cached
Mem:      65997404    8096588   57900816          0     184260    7212252
-/+ buffers/cache:     700076   65297328
Swap:     35562236          0   35562236

# ldconfig -p | grep python
libpython2.7.so.1.0 (libc6,x86-64) => /usr/lib/libpython2.7.so.1.0
libpython2.7.so (libc6,x86-64) => /usr/lib/libpython2.7.so

How to compress big data (more than 2Gb) in Python?

like image 202
Dmitry Skryabin Avatar asked May 27 '14 05:05

Dmitry Skryabin


People also ask

How do I highly compress files in Python?

To create your own compressed ZIP files, you must open the ZipFile object in write mode by passing 'w' as the second argument. When you pass a path to the write() method of a ZipFile object, Python will compress the file at that path and add it into the ZIP file.

How do you compress a long string in Python?

compress(text) should be compressed = zlib. compress(text. encode()) . It seems that this is most effective with longer strings.

How do I compress an H5 file?

After saving the model into HDF5, you need to load the model and save the weights of it. By this H5 or HDF5 file size will be reduced.

Does pickle use compression?

Standard python pickle, thinly wrapped with standard compression libraries. The standard pickle package provides an excellent default tool for serializing arbitrary python objects and storing them to disk. Standard python also includes broad set of data compression packages.


2 Answers

My function to compress large data:

def compressData(self, s):
    compressed = ''
    begin = 0
    blockSize = 1073741824 # 1Gb
    compressor = zlib.compressobj()
    while begin < len(s):
      compressed = compressed + compressor.compress(s[begin:begin + blockSize])
      begin = begin + blockSize
    compressed = compressed + compressor.flush()
    return compressed
like image 71
Dmitry Skryabin Avatar answered Oct 04 '22 22:10

Dmitry Skryabin


This is not a RAM issue. Simply either zlib or the python binding cannot handle data larger than 4GB.

Split your data into 4GB (or smaller chunks) and process each one separately.

like image 37
JBernardo Avatar answered Oct 04 '22 22:10

JBernardo