Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Huffman encoding

Under what conditions does Huffman encoding make a string not compressible? Is it when all the characters appear with equal frequency/probability? And if so, how can one show this is true?

like image 308
DillPixel Avatar asked Jul 22 '12 15:07

DillPixel


People also ask

What is meant by Huffman encoding?

Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.

What is Huffman coding example?

Huffman coding is a lossless data compression algorithm. In this algorithm, a variable-length code is assigned to input different characters. The code length is related to how frequently characters are used. Most frequent characters have the smallest codes and longer codes for least frequent characters.

How do you encode with Huffman?

Steps for Huffman Encoding:Create a new internal node N3 with frequency equal to the sum of frequency of nodes N1 and N2. Make N1 as the left child of N3 and N2 as the right child of N3. Add this new node N3 to the Minimum Heap. Repeat steps 2 and 3 till the point, the Minimum Heap has only one node.

How good is Huffman encoding?

Abstract: Huffman coding is known to be optimal, yet its dynamic version may yield smaller compressed files. The best known bound is that the number of bits used by dynamic Huffman coding in order to encode a message of n characters is at most larger by n bits than the number of bits required by static Huffman coding.


1 Answers

You can calculate a simple zero-order entropy for a sequence of symbols which will tell you if you even have a chance of significant compression with just Huffman coding. (I wish stackoverflow had TeX formatting like math.stackexchange.com does. I can't write decent equations here.)

Count how many different symbols you have and call that n, with the symbols numbered 1..n. Compute the probability of each symbol, which is how many times each symbol occurs divided by the length of the sequence, and call that p(k). Then the best you can do with zero-order coding is an average number of bits per symbol equal to: -sum(p(k)log(p(k)),k=1..n)/log(2). Then you can compare the result to log(n)/log(2) which is what the answer would be if all the probabilities were equal (1/n) to see how much the unequal probabilities could buy you. You can also compare the result to, for example, 8, if you are currently storing the symbols as a byte each (in which case n <= 256).

A Huffman code will have equal to or more bits per symbol than that entropy. You also need to take into account how you will convey the Huffman code to the receiver. You will need some sort of header describing the code, which will take more bits. An arithmetic or range code could get closer to the entropy than the Huffman code, especially for very long sequences.

In general, a Huffman code by itself will not produce very satisfying compression. A quick test on the 100M character English text test file enwik8 gives an entropy of about five bits per symbol, as does Huffman coding of the text. Huffman (or arithmetic or range) coding needs to be used in combination with a higher-order model of the input data. These models can be simple string matching, like LZ77 as used in deflate or LZMA, a Burrows-Wheeler transform, or prediction by partial matching. An LZ77 compressor, in this case gzip, gets less than three bits per symbol.

I can't resist including a picture of Boltzmann's gravestone, engraved on which is his formula that connects entropy to probability, essentially the formula above.

enter image description here

like image 131
Mark Adler Avatar answered Oct 04 '22 18:10

Mark Adler