Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measuring efficiency of Huffman coding with Python bitstring

I have the following string that I would like to Huffman-encode and store efficiently into a bit array:

>>> print sequence
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|

The frequencies of the symbols in sequence are:

>>> print freqTuples
[(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`

I translate this into a Huffman code dictionary:

>>> print codeDict
{'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}

I then used the Python bitstring package to translate the string, character by character, into an instance of the BitArray class, which I call bitArray, which contains bits for each character encoded with its respective Huffman code:

>>> print bitArray.bin
0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111

Here is the bit array in bytes:

>>> print bitArray.tobytes()
!I\254\363[^D\260^Z\240Ap

I must use tobytes() instead of bytes, as the bit array I generate does not divide evenly into 8-bit segments.

When I calculate the storage efficiency of the BitArray representation (the ratio of the sizes of the bit array and the input string), I get worse performance than if I had left the input string unencoded:

>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973

Am I measuring storage efficiency correctly? (If I encode longer input strings, this ratio improves, but it seems to approach an asymptotic limit of around 0.28. I'd like to confirm if this is the right way to measure things.)

Edit

The following two approaches yield different answers:

>>> print len(bitArray.tobytes()) / float(len(mergedSequence))
0.297297297297

>>> print bitArray.len / (8.*len(mergedSequence))
0.283783783784

I'm not sure which to believe. But in the process of writing data to storage, I think I would need the byte representation, which makes me inclined towards choosing the first result.

like image 455
Alex Reynolds Avatar asked Nov 07 '11 23:11

Alex Reynolds


1 Answers

>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973

Implies that the encoded version is 30% longer than the original sequence.

I don't think you want to use getsizeof here -- if you want to minimize the size of the Python object, you should be using getsizeof(sequence) as well, rather than len.

If instead, you want to do what Huffman coding is meant to do, and minimize the binary representation, then you want to use len on both (assuming the sequence is represented as one-byte-per-character).

So, your real ratio is 11 / 37.

I assume you're using Huffman coding as an exercise, as this doesn't seem like a logical way to efficiently store what is just a four-bit code with a termination character. At least it would be better to use arithmetic coding, which will allow you to use base-5 encoding instead of base-2, which is optimal for 5 possible characters.

Really, I would assume in a sequence long enough to be worth compressing, there is a known ratio of G:A:C:T and / or fixed length 2-bit encoding will be just as efficient (the ratios approach 1:1:1:1) since you don't really need to encode the termination character.

like image 129
agf Avatar answered Oct 13 '22 01:10

agf