So yesterday I asked a question on compression of a sequence of integers (link) and most comments had a similar point: if the order is random (or worst, the data is completely random) then one have to settle down with log2(k) bits for a value k. I've also read similar replies in other questions on this site. Now, I hope this isn't a silly question, if I take that sequence and serialize it to a file and then I run gzip on this file then I do achieve compression (and depending on the time I allow gzip to run I might get high compression). Could somebody explain this fact ?
Thanks in advance.
In particular, files of random data cannot be consistently compressed by any conceivable lossless data compression algorithm; indeed, this result is used to define the concept of randomness in Kolmogorov complexity. It is provably impossible to create an algorithm that can losslessly compress any data.
There are broadly two types of data compression techniques—lossy and lossless. In lossy, the insignificant piece of data is removed to reduce the size, while in lossless compression, the data is transformed through encoding, and its size is reduced.
Data compression can dramatically decrease the amount of storage a file takes up. For example, in a 2:1 compression ratio, a 20 megabyte (MB) file takes up 10 MB of space. As a result of compression, administrators spend less money and less time on storage.
Lossless Compression ,Text compression (lossless) , Run-length Encoding , Huffman Coding , Shannon-FANO Coding. Lossless Compression ,Text compression (lossless) , Run-length Encoding , Huffman Coding , Shannon-FANO Coding.
if I take that sequence and serialize it to a file and then I run gzip on this file then I do achieve compression
What is "it"? If you take random bytes (each uniformly distributed in 0..255) and feed them to gzip or any compressor, you may on very rare occasions get a small amount of compression, but most of the time you will get a small amount of expansion.
My guess is that you're achieving compression on your random file because you're not using an optimal serialization technique, but without more details it's impossible to answer your question. Is the compressed file with n numbers in the range [0, k) less than n*log2(k) bits? (That is, n*log256(k) bytes). If so, does gzip manage to do that for all the random files you generate, or just occasionally?
Let me note one thing: suppose you say to me, "I've generated a file of random octets by using a uniform_int_distribution(0, 255) with the mt19937 prng [1]. What's the optimal compression of my file?" Now, my answer could reasonably be: "probably about 80 bits". All I need to reproduce your file is
the value you used to seed the prng, quite possibly a 32-bit integer [2]; and
the length of the file, which probably fits in 48 bits.
And if I can reproduce the file given 80 bits of data, that's the optimal compression. Unfortunately, that's not a general purpose compression strategy. It's highly unlikely that gzip will be able to figure out that you used a particular prng to generate the file, much less that it will be able to reverse-engineer the seed (although these things are, at least in theory, achievable; the Mersenne twister is not a cryptographically secure prng.)
For another example, it's generally recommended that text be compressed before being encrypted; the result will be quite a bit shorter than compressing after encryption. But the fact is that encryption adds very little entropy; at most, it adds the number of bits in the encryption key. Nonetheless, the resulting output is difficult to distinguish from random data, and gzip will struggle to compress it (although it often manages to squeeze a few bits out).
Note 1: Note: that's all c++11/boost lingo. mt19937 is an instance of the Mersenne twister pseudo-random number generator (prng), which has a period of 2^19937 - 1.
Note 2: The state of the Mersenne twister is actually 624 words (19968 bits), but most programs use somewhat fewer bits to seed it. Perhaps you used a 64-bit integer instead of a 32-bit integer, but it doesn't change the answer by much.
If the data is truly random, on average no compression algorithm can compress it. But if the data has some predictable patterns (for e.g. if the probability of a symbol is dependent on the previous k-symbols occurring in the data), many (prediction-based) compression algorithms will succeed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With