My friend told me it existed but I could never find it, not sure if he was lying but I'm very interested as to how the proof works. (Yes, I'm one of those people who found out about Huffman coding from the Silicon Valley TV show, sorry)
Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.
Huffman Coding is a method of lossless compression. Lossless compression is valuable because it can reduce the amount of information (or in your computer, memory) needed to communicate the exact same message. That means that the process is perfectly invertible. Lossy compression on the otherhand will lose information.
However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods - it is replaced with arithmetic coding or asymmetric numeral systems if better compression ratio is required.
Given that the source entropy is H and the average codeword length is L, we can characterise the quality of a code by either its efficiency (η = H/L as above) or by its redundancy, R = L – H. Clearly, we have η = H/(H+R).
It is not the most efficient lossless compression method. Arithmetic coding beats it for a start. Since it is not the most efficient, there is no proof that it is. I believe it is the optimal code when using an integer number of bits per symbol however, perhaps that is the proof your friend was talking about.
The answer is it is, it isn't, and the question is ill-posed. :-)
Here is a high level view. Lossless compression algorithms provide a reversible mapping of possible documents to be compressed, to compressed documents. Documents can be viewed as strings of bits. There are 2^n possible documents with n bits. There are 2^n possible compressed documents with n bits. Therefore the pidgin-hole principle says that for every document that is stored more efficiently, some other possible document must be stored less efficiently.
So how is compression possible? It is possible because while all documents are possible, they are not equally likely. So a good compression algorithm will store likely documents very efficiently, and unlikely ones inefficiently. But then the question is what documents are efficient. The answer to that is, "It depends." And the answer to how good a compression algorithm is will also depend.
Suppose that you take the set of random documents made out of a set of symbols that independently appear with different probabilities. Huffman coding produces the most efficient possible compression algorithm.
Now suppose you take the set of random sentences that are likely to be written in English? Huffman coding is limited to looking at raw letter frequencies. It makes no use of the fact that certain combinations of letters appear very frequently. Other encodings that can use that will now work better.
Now suppose you take the set of documents that could be produced by your camera. This looks nothing like text, and different encoding methods will work better.
So there are cases where Huffman is best. Cases where it isn't. And the question is ill-posed since it depends on, "What documents are likely?"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With