From Shannon's Source Coding Theorem we know that the entropy of a compressed string is bounded by the entropy of the original string like so:
H(X) <= L < H(X) + 1/N
where H(X) is entropy of the source string, N is the length of the source string, and L is the expected length of the compressed string.
This necessarily means that there is a limit to lossless compression.
What I'd like to know is:
Can we directly relate entropy to some expected compression ratio?
Can we use the entropy to find some upper bound for the compression ratio?
Shannon's Theorem is defined in terms of random data and probabilities. Similarly, the entropy of a string is only defined for random strings -- the entropy is a property of the distribution, not of the strings themselves. So, we can restate Shannon's Theorem informally as:
If you randomly select a string from a given probability distribution, then the best average compression ratio we can get for the string is given by the entropy rate of the probability distribution.
Given any random string, I can easily write a compression algorithm which will compress that string down into 1 bit, but my algorithm will necessarily increase the length of some other strings. My compression algorithm works as follows:
The corresponding decompression algorithm is:
The key here is that we can't write down one algorithm which, for all strings from a given distribution, compresses them all at a high rate on average. There's just too many strings.
If we have a given probability distribution of strings, we can calculate the entropy rate of the distribution, and then if randomly pick a string according to the distribution and attempt to compress it using any algorithm, the relative size of the compressed string will, on average, never be less than the entropy rate. This is what Shannon's Theorem says.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With