Practical Compression of Random Data

Tags:

So yesterday I asked a question on compression of a sequence of integers (link) and most comments had a similar point: if the order is random (or worst, the data is completely random) then one have to settle down with log2(k) bits for a value k. I've also read similar replies in other questions on this site. Now, I hope this isn't a silly question, if I take that sequence and serialize it to a file and then I run gzip on this file then I do achieve compression (and depending on the time I allow gzip to run I might get high compression). Could somebody explain this fact ?

Thanks in advance.

666

asked Sep 21 '12 17:09

jplot

3 Answers

if I take that sequence and serialize it to a file and then I run gzip on this file then I do achieve compression

What is "it"? If you take random bytes (each uniformly distributed in 0..255) and feed them to gzip or any compressor, you may on very rare occasions get a small amount of compression, but most of the time you will get a small amount of expansion.

answered Oct 23 '22 11:10

Mark Adler

My guess is that you're achieving compression on your random file because you're not using an optimal serialization technique, but without more details it's impossible to answer your question. Is the compressed file with n numbers in the range [0, k) less than n*log2(k) bits? (That is, n*log256(k) bytes). If so, does gzip manage to do that for all the random files you generate, or just occasionally?

Let me note one thing: suppose you say to me, "I've generated a file of random octets by using a uniform_int_distribution(0, 255) with the mt19937 prng [1]. What's the optimal compression of my file?" Now, my answer could reasonably be: "probably about 80 bits". All I need to reproduce your file is

the value you used to seed the prng, quite possibly a 32-bit integer [2]; and
the length of the file, which probably fits in 48 bits.

And if I can reproduce the file given 80 bits of data, that's the optimal compression. Unfortunately, that's not a general purpose compression strategy. It's highly unlikely that gzip will be able to figure out that you used a particular prng to generate the file, much less that it will be able to reverse-engineer the seed (although these things are, at least in theory, achievable; the Mersenne twister is not a cryptographically secure prng.)

For another example, it's generally recommended that text be compressed before being encrypted; the result will be quite a bit shorter than compressing after encryption. But the fact is that encryption adds very little entropy; at most, it adds the number of bits in the encryption key. Nonetheless, the resulting output is difficult to distinguish from random data, and gzip will struggle to compress it (although it often manages to squeeze a few bits out).

Note 1: Note: that's all c++11/boost lingo. mt19937 is an instance of the Mersenne twister pseudo-random number generator (prng), which has a period of 2^19937 - 1.

Note 2: The state of the Mersenne twister is actually 624 words (19968 bits), but most programs use somewhat fewer bits to seed it. Perhaps you used a 64-bit integer instead of a 32-bit integer, but it doesn't change the answer by much.

195

answered Oct 23 '22 10:10

rici

If the data is truly random, on average no compression algorithm can compress it. But if the data has some predictable patterns (for e.g. if the probability of a symbol is dependent on the previous k-symbols occurring in the data), many (prediction-based) compression algorithms will succeed.

answered Oct 23 '22 10:10

krjampani

Related questions
                            
                                What is wrong with the pearson algorithm from “Programming Collective Intelligence”?
                            
                                calculating parameters for defining subsections of quadratic bezier curves
                            
                                Apply algorithms considering a specific edge subset
                            
                                What is the best algorithm for closest word
                            
                                How can I take the modulus of two very large numbers?
                            
                                How to compare rational numbers?
                            
                                Which one is the real Bubble Sort, and which one is better?
                            
                                iterative algorithm for combination generation [duplicate]
                            
                                Algorithm to combine / merge date ranges
                            
                                analysing time complexity of my programs
                            
                                How to merge similar items in a list
                            
                                Example of loop using pointers rewritten using an STL algorithm, without a loop?
                            
                                Algorithm for drawing lines on the plane
                            
                                Majority Voting Algorithm - WRONG?
                            
                                Any Advantage of MPI+CUDA over just pure MPI?
                            
                                Convert decimal to latitude and longitude
                            
                                Perfoming Cartesian product on arrays
                            
                                Calculating CPU usage
                            
                                Count number of different values in chosen (large) range in VBA?
                            
                                Find adjacent elements in a 2D matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Practical Compression of Random Data

Tags:

algorithm

compression

gzip