Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Zlib compress in python

Why is the size of the compressed string bigger? Doesn't the zlib need to compress ??

Example:

import zlib
import sys

str1 = "abcdefghijklmnopqrstuvwxyz"
print "size1: ", sys.getsizeof(str1)

print "size2: ", sys.getsizeof(zlib.compress(str1))

The output:

size1:  47
size2:  55
like image 754
CSch of x Avatar asked Mar 22 '18 15:03

CSch of x


2 Answers

You're going to have a hard time compressing a string like that. It's rather short and contains 26 unique characters. Compressors work by assigning byte values to common words, characters, etc, so by having all unique characters you'll get poor performance.

You'll also get poor performance if the data is random.

Here's an example with a string of the same length which compresses.

>>> str2 = 'a'*26
>>> str2
'aaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> sys.getsizeof(str2)
63
>>> sys.getsizeof(zlib.compress(str2))
48
like image 126
Grant Williams Avatar answered Oct 10 '22 11:10

Grant Williams


Grant's answer is fine, but something here needs to be emphasized.

Doesn't the zlib need to compress ??

No! It does not, and cannot always compress. Any operations that losslessly compress and decompress and input must expand some, actually most, inputs, while compressing only some inputs. This is a simple and obvious consequence of counting.

The only thing that is guaranteed by a lossless compressor is that what you get out from decompression is what you put in to compression.

Any useful compression scheme is rigged to take advantage of the specific redundancies expected to be found in the particular kind of data being compressed. Language data, e.g. English, C code, data files, even machine code, which is a sequence of symbols with a specific frequency distribution and oft repeated strings, is compressed using models that are expecting and looking for those redundancies. Such schemes depend on gathering information on the data being compressed in the first, at least, 10's of Kbytes before the compression starts being really effective.

Your example is far too short to have the statistics needed, and has no repetition of any kind, and so will be expanded by any general compressor.

like image 35
Mark Adler Avatar answered Oct 10 '22 11:10

Mark Adler