Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to Generate 'n' Binary Prefix Codes

A Prefix Code is a set of codes such that no code is a prefix of another code. For example, the following set is a prefix code:

10
11
000
001
0100
0101
0110
0111

With n = 8 members. I think these are usually created with some type of Huffman tree.

My question is: Could you help me create a function that will generate a binary prefix code with 'n' members?

Something like this:

list<int> GenerateBinaryPrefixCodes(int n);

Also, the requirement is that it be "optimal" in the sense that the total sum of bits is minimized.

I would prefer an answer in C/C++/C#/something similar. This isn't really homework, but I tagged it that way because it sounds like it would be a good hw problem.

Thanks!

like image 808
user807566 Avatar asked Sep 06 '11 15:09

user807566


2 Answers

Prefix Codes

As you pointed out, a Prefix Code is one where a given code is not a prefix for any other given code. This is a very general definition. A Huffman encoding is a restricted form of Prefix Code.

A common usage for Huffman coding is to minimize (optimize) the total bit count needed to encode a "message". A "message" is typically a sequence of symbols and it is encoded by mapping each symbol occurrence to a specific prefix code and writing out the prefix code in its place. Any set of prefix codes could be used to do this. But, a Huffman encoding will result in the shortest possible message based on bit count.

For example the ASCII character set could be considered as a mapping of symbols to a set of 8 bit prefix codes. This could even be considered a Huffman encoding provided that the encoded message contained exactly the same number of each possible symbol.

The interesting stuff starts when the message to be encoded contains symbol frequencies that are unequal. At this point one can reduce the total bit length of the message by using prefix codes of different lengths. Use short prefix codes for more frequent symbols and longer prefix codes for less frequent symbols.

From your example there are 8 symbols to encode. Symbols mapped to prefix codes '11' and '10' would be the most frequent symbols in the message. Likewise, symbols mapped to '0111', '0110', '1010' and '0100' would be least frequent. Higher the frequency the shorter the prefix code.

The "trick" in creating a Huffman coding is to build the set of Prefix Codes such that after mapping each symbol in the message to their associated prefix codes the message contains as few bits as possible.

I find it useful to view prefix codes as a binary tree where each leaf node maps to a symbol. For example, the binary tree corresponding to the prefix codes given in your question (01, 11, 000, 001, 0100, 0101, 0110, 0111) would be:

           +-- (11)
        +--+
        |  +-- (10)
        |
        |        +-- (0111)
      --+     +--+
        |     |  +-- (0110)
        |  +--+
        |  |  |  +-- (0101)
        |  |  +--+
        +--+     +-- (0100)
           |
           |  +-- (001)
           +--+
              +-- (000)

To get the values in brackets you just assign a '1' when the top edge is followed or a '0' if the bottom edge is followed.

How to build such a tree?

Start with data structures to represent a binary tree and a list.

The binary tree will contain two types of node. 1) A leaf node representing a symbol and its frequency and 2) an internal node representing the cumulative frequency of all the nodes below it (it also needs two pointers, one for the left branch and one for the right branch).

The list contains an ordered set of nodes from the binary tree. Nodes in the list are ordered based on the frequency value of the node they point to. Lowest frequency nodes occur at the front of the list and increase toward the end of the list. A linked list of pointers to tree nodes might be a useful implementation - but any ordered list structure will do.

The algorithm below employs two lists: a "reference" and a "working" list. As nodes are processed from the "reference" list new nodes are created and inserted into the "working" list such that the "working" list remains ordered by node frequency.

Use these data structures and the following algorithm to create a Huffman encoding.

0. Initialize the "reference" list by creating a leaf node for each symbol
   then add it into this list such that nodes with the lowest frequency 
   occur at the front of the list and those with the highest frequency
   occur at the back (basically a priority queue).

1. Initialize the "working" list to empty.

2. Repeat until "reference" list contains 1 node

   2.1 Set MaxFrequency to the sum of the first 2 node frequencies

   2.1 Repeat until "reference" list is empty
       If ("reference" list contains 1 node) OR
          (sum of the next two nodes frequency > MaxFrequency)
            Move remaining nodes to the "working" list
            Set "reference" list to empty
       Else
          Create a new internal node
          Connect the first "reference" node to the left child
          Connect the second "reference" node to the right child
          Set the new node frequency to the sum of the frequencies of the children
          Insert the new node into the "working" list
          Remove the first and second nodes from the "reference" list

   2.2 Copy the "working" list to the "reference" list
   2.3 Set the "working" list to empty

At the end of this process the single "reference" list item will be the root of a Huffman tree. You can enumerate prefix codes by doing a depth first traversal of the tree. Write out a '0' for every left branch taken and a '1' for every right branch. The code is complete when a leaf is encountered. The symbol at the leaf is encoded by the Huffman code just generated.

What is an optimum encoding

An interesting calculation one can perform is to calculate the "bit weight" of a prefix encoding. The bit weight is the total number of bits needed to represent the set of prefix codes.

Look at your original tree above. The weight of this tree is (2 bits * 2) + (4 bits * 5) + (3 bits * 2) = 30 bits. You used 30 bits to represent 8 prefix values. What is the minimal number of bits you could have used? Think about it, as a tree becomes unbalanced the length of the path to some leaves gets longer - this adds to the weight. For example the worst case for a 4 value prefix tree would be:

                 +-- (1 bit)
               --+                  
                 |  +-- (2 bits)
                 +--+
                    |  +-- (3 bits)
                    +--+
                       +-- (3 bits)

giving a total weight of (1 bit * 1) + (2 bits * 1) + (3 bits * 2) = 9 bits

Balance the tree:

                +-- (2 bits)
             +--+
             |  +-- (2 bits)
           --+  
             |  +-- (2 bits)
             +--+
                +-- (2 bits)

giving a total weight of (2 bits * 4) = 8 bits. Notice that for balanced trees all prefix codes end up having the same number of bits.

Tree bit weight is just the sum of the path lengths to all leaves. You minimize the bit weight by minimizing the total path length - and this is done by balancing the tree.

As you can see, there isn't much value in minimizing any given prefix tree, you just end up with a fixed length symbol encoding. The value comes when you consider the bit weight of the resulting encoded message. Minimizing that leads to Huffman encoding.

How many different encodings are there?

Prefix codes may be generated by traversing a binary tree and emitting a '0' for each lower branch followed and a '1' for each upper branch followed until a leaf is encountered. As in:

             +--+ (1)
             |  
           --+  
             |  +-- (01)
             +--+
                +-- (00)

Alternatively we could "flip" that rule and assign a '1' for each lower branch and a '0' for the upper branches:

             +-- (0)
             |  
           --+  
             |  +-- (10)
             +--+
                +-- (11)

These generate two different sets of prefix codes. Addtitional sets can be generated by going through all the possible 1/0 assignments to branches and then traversing the tree. This will give you 2^n sets. But if you do this, you will find the same prefix codes may be generated but in different order. For example, the previous tree would yield the following sets: {(0, 10, 11), (0, 11, 01), (1, 01, 00), (1, 00, 01)}. Then flip the tree to:

                +-- (??)
             +--+
             |  +-- (??)
           --+
             |
             +-- (?)

and you get: {(11, 10, 0), (10, 11, 0), (01, 00, 1), (00, 01, 1)}. Put them both together for 2^3 = 8 sets. However if you want unique sets disregarding order there are only 2 sets: {(0, 10, 11), (1, 00, 01)}. Go through the same exercise for a balanced tree and there is only ever 1 set. All this leads me to believe that the number of unique encodings is related to the balance structure of the tree used to generate prefix codes. Unfortunately, I don't have an exact formula or calculation worked out. On a hunch I would guess the number would be 2^(number of distinct code lengths - 1). For a balanced tree that is: 2^(1 - 1) = 1; for a tree with two distinct code lengths (as in the example above): 2^(2 - 1) = 2; and for your example: 2^(3 - 1) = 4.

like image 171
NealB Avatar answered Oct 17 '22 00:10

NealB


The requirement that the sum of the number of bits is minimized is equivalent to requiring the codes to be optimal Huffman codes for a string where each symbol occurs once. So simply create a string with n unique characters and produce a Huffman tree for it. The algorithm is outlined on Wikipedia.

like image 25
Aasmund Eldhuset Avatar answered Oct 17 '22 01:10

Aasmund Eldhuset