Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relation of Entropy to Lossless Compression Rate

From Shannon's Source Coding Theorem we know that the entropy of a compressed string is bounded by the entropy of the original string like so:

H(X) <= L < H(X) + 1/N 

where H(X) is entropy of the source string, N is the length of the source string, and L is the expected length of the compressed string.

This necessarily means that there is a limit to lossless compression.

What I'd like to know is:

  • Can we directly relate entropy to some expected compression ratio?

  • Can we use the entropy to find some upper bound for the compression ratio?

like image 599
Landon Avatar asked Feb 26 '09 19:02

Landon


1 Answers

Shannon's Theorem is defined in terms of random data and probabilities. Similarly, the entropy of a string is only defined for random strings -- the entropy is a property of the distribution, not of the strings themselves. So, we can restate Shannon's Theorem informally as:

If you randomly select a string from a given probability distribution, then the best average compression ratio we can get for the string is given by the entropy rate of the probability distribution.

Given any random string, I can easily write a compression algorithm which will compress that string down into 1 bit, but my algorithm will necessarily increase the length of some other strings. My compression algorithm works as follows:

  1. If the input string equals some pre-chosen random string, the output is the 1-bit string "0"
  2. Otherwise, the output is the N+1-bit string of "1" followed by the input string

The corresponding decompression algorithm is:

  1. If the input is "0", the output is our previous pre-chosen random string
  2. Otherwise, the output is everything except for the first input bit

The key here is that we can't write down one algorithm which, for all strings from a given distribution, compresses them all at a high rate on average. There's just too many strings.

If we have a given probability distribution of strings, we can calculate the entropy rate of the distribution, and then if randomly pick a string according to the distribution and attempt to compress it using any algorithm, the relative size of the compressed string will, on average, never be less than the entropy rate. This is what Shannon's Theorem says.

like image 144
Adam Rosenfield Avatar answered Oct 03 '22 08:10

Adam Rosenfield