Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

variations in huffman encoding codewords

Tags:

huffman-code

I'm trying to solve some huffman coding problems, but I always get different values for the codewords (values not lengths). for example, if the codeword of character 'c' was 100, in my solution it is 101.

Here is an example:

Character    Frequency   codeword    my solution
   A            22          00          10
   B            12          100         010
   C            24          01          11
   D            6           1010        0110
   E            27          11          00
   F            9           1011        0111

Both solutions have the same length for codewords, and there is no codeword that is prefix of another codeword.

Does this make my solution valid ? or it has to be only 2 solutions, the optimal one and flipping the bits of the optimal one ?

like image 496
a.u.r Avatar asked Mar 23 '23 18:03

a.u.r


1 Answers

There are 96 possible ways to assign the 0's and 1's to that set of lengths, and all would be perfectly valid, optimal, prefix codes. You have shown two of them.

There exist conventions to define "canonical" Huffman codes which resolve the ambiguity. The value of defining canonical codes is in the transmission of the code from the compressor to the decompressor. As long as both sides know and agree on how to unambiguously assign the 0's and 1's, then only the code length for each symbol needs to be transmitted -- not the codes themselves.

The deflate format starts with zero for the shortest code, and increments up. Within each code length, the codes are ordered by the symbol values, i.e. sorting by symbol. So for your code that canonical Huffman code would be:

A - 00
C - 01
E - 10
B - 110
D - 1110
F - 1111

So there the two bit codes are assigned in the symbol order A, C, E, and similarly, the four bit codes are assigned in the order D, F. Shorter codes are assigned before longer codes.

There is a different and interesting ambiguity that arises in finding the code lengths. Depending on the order of combination of equal frequency nodes, i.e. when you have a choice of more than two lowest frequency nodes, you can actually end up with different sets of code lengths that are exactly equally optimal. Even though the code lengths are different, when you multiply the lengths by the frequencies and add them up, you get exactly the same number of bits for the two different codes.

There again, the different codes are all optimal and equally valid. There are ways to resolve that ambiguity as well at the time the nodes to combine are chosen, where the benefit can be minimizing the depth of the tree. That can reduce the table size for table-driven Huffman decoding.

For example, consider the frequencies A: 2, B: 2, C: 1, D: 1. You first combine C and D to get 2. Then you have A, B, and C+D all with frequency 2. Now you can choose to combine either A and B, or C+D with A or B. This gives two different sets of bit lengths. If you combine A and B, you get lengths: A-2, B-2, C-2, and D-2. If you combine C+D with B, you get A-1, B-2, C-3, D-3. Both are optimal codes, since 2x2 + 2x2 + 1x2 + 1x2 = 2x1 + 2x2 + 1x3 + 1x3 = 12, so both codes use 12 bits to represent those symbols that many times.

like image 124
Mark Adler Avatar answered May 16 '23 10:05

Mark Adler