Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In simple terms, how is compression commonly implemented?

So I've been thinking lately about how compression might be implemented, and what I've postulated so far is that it might be using a sort of HashTable of 'byte signature' keys with memory location values where that 'byte signature' should be replaced upon expansion of the compressed item in question.

Is this far from the truth?

How is compression typically implemented? No need for a page worth of answer, just in simple terms is fine.

like image 816
dreadwail Avatar asked Jul 05 '09 12:07

dreadwail


1 Answers

Compressing algorithms try to find repeated subsequences to replace them with a shorter representation.

Let's take the 25 byte long string Blah blah blah blah blah! (200 bit) from An Explanation of the Deflate Algorithm for example.

Naive approach

A naive approach would be to encode every character with a code word of the same length. We have 7 different characters and thus need codes with the length of ceil(ld(7)) = 3. Our code words can than look like these:

000 → "B"
001 → "l"
010 → "a"
011 → "h"
100 → " "
101 → "b"
110 → "!"
111 → not used

Now we can encode our string as follows:

000 001 010 011 100 101 001 010 011 100 101 001 010 011 100 101 001 010 110
B   l   a   h   _   b   l   a   h   _   b   l   a   h   _   b   l   a   !

That would just need 25·3 bit = 75 bit for the encoded word plus 7·8 bit = 56 bit for the dictionary, thus 131 bit (65.5%)

Or for sequences:

00 → "lah b"
01 → "B"
10 → "lah!"
11 → not used

The encoded word:

01 00    00    00    00    10
B  lah b lah b lah b lah b lah!

Now we just need 6·2 bit = 12 bit for the encoded word and 10·8 bit = 80 bit plus 3·8 bit = 24 bit for the length of each word, thus 116 bit (58.0%).

Huffman code approach

The Huffman code is used to encode more frequent characters/substrings with shorter code than less frequent ones:

5 × "l", "a", "h"
4 × " ", "b"
1 × "B", "!"

// or for sequences

4 × "lah b"
1 × "B", "lah!"

A possible Huffman code for that is:

0      → "l"
10     → "a"
110    → "h"
1110   → " "
11110  → "b"
111110 → "B"
111111 → "!"

Or for sequences:

0  → "lah b"
10 → "B"
11 → "lah!"

Now our Blah blah blah blah blah! can be encoded to:

111110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 111111
B      l a  h   _    b     l a  h   _    b     l a  h   _    b     l a  h   _    b     l a  h   !

Or for sequences:

10 0     0     0     0     11
B  lah b lah b lah b lah b lah!

Now out first code just needs 78 bit or 8 bit instead of 25·8 = 200 bit like our initial string has. But we still need to add the dictionary where our characters/sequences are stored. For our per-character example we would need 7 additional bytes (7·8 bit = 56 bit) and our per-sequence example would need again 7 bytes plus 3 bytes for the length of each sequence (thus 59 bit). That would result in:

56 + 78 = 134 bit (67.0%)
59 +  8 =  67 bit (33.5%)

The actual numbers may not be correct. Please feel free to edit/correct it.

like image 116
7 revs, 2 users 92% Avatar answered Oct 04 '22 13:10

7 revs, 2 users 92%