This is a question I ran into in school settings, but it keeps bugging me so I decided to ask it here.
In Huffman compression, fixed-length sequences (characters) are encoded with variable-length sequences. The code sequence length depends on the frequencies (or probabilities) of the source characters.
My questions is: what is the minimum highest character frequency, with which that character will be encoded by a single bit?
Huffman is an example of a variable-length encoding—some characters may only require 2 or 3 bits and other characters may require 7, 10, or 12 bits.
Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.
By tracing each path through the graph, we can see that the Huffman codewords for this alphabet are: This gives an average codeword length of 2.2 bits/symbol.
It turns out that the answer is 0.4, that is, if the highest frequency p is p >= 0.4, 1-bit code for the corresponding character is guaranteed. In other words, this is a sufficient condition.
It is also true that p >= 1/3 is a necessary condition. That is, there may be examples where 0.4 > p >= 1/3, and the shortest code is 1-bit, but there are no cases like that if p < 1/3.
The way to reason about that is to look at the way the code tree is constructed, in particular at the frequencies of the 3 last surviving subtrees. A proof appears in Johnsen, "On the redundancy of binary Huffman codes", 1980 (unfortunately this is a paid link).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With