Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the concept behind zip compression?

Tags:

compression

What's the concept behind zip compression? I can understand the concept of removing empty space etc, but presumably something has to be added to say how much/where that free space needs to be added back in during decompression?

What's the basic process for compressing a stream of bytes?

like image 966
Neil Barnwell Avatar asked Sep 09 '09 13:09

Neil Barnwell


People also ask

Why would you compress a ZIP file?

Zipped (compressed) files take up less storage space and can be transferred to other computers more quickly than uncompressed files. In Windows, you work with zipped files and folders in the same way that you work with uncompressed files and folders.

Does zip compression reduce quality?

A compressed zip file is an archive file format featuring lossless compression. This format means that your files and folders will compress, get smaller in data size, but not lose anything in terms of quality or data. Your original files will always remain intact.

What compression algorithm does zip use?

Lempel–Ziv Compression (LZ77) There are two main ingredients in Zip compression: Lempel–Ziv compression and Huffman coding.

How much space does zip compression save?

File compression reduces the size of the file as much as 90%, without losing any of the primary data. Compressing a file is also known as zipping. Therefore, file compression helps the user save a considerable amount of disk space.


2 Answers

A good place to start would be to lookup the Huffman compression scheme. The basic idea behind huffman is that in a given file some bytes appear more frequently then others (in a plaintext file many bytes won't appear at all). Rather then spend 8 bits to encode every byte, why not use a shorter bit sequence to encode the most common characters, and a longer sequences to encode the less common characters (these sequences are determined by creating a huffman tree).

Once you get a handle on using these trees to encode/decode files based on character frequency, imagine that you then start working on word frequency - instead of encoding "they" as a sequence of 4 characters, why not consider it to be a single character due to its frequency, allowing it to be assigned its own leaf in the huffman tree. This is more or less the basis of ZIP and other lossless type compression - they look for common "words" (sequences of bytes) in a file (including sequences of just 1 byte if common enough) and use a tree to encode them. The zip file then only needs to include the tree info (a copy of each sequence and the number of times it appears) to allow the tree to be reconstructed and the rest of the file to be decoded.

Follow up:

To better answer the original question, the idea behind lossless compression is not so much to remove empty space, but to remove redundent information.

If you created a database to store music lyrics, you'd find a lot of space was being used to store the chorus which repeats several times. Instead of using all that space, you could simply place the word CHORUS before the first instance of the chorus lines, and then every time the chorus is to be repeated, just use CHORUS as a place holder (in fact this is pretty much the idea behind LZW compression - in LZW each line of the song would have a number shown before it. If a line repeats later in the song, rather then write out the whole line only the number is shown)

like image 195
David Avatar answered Sep 21 '22 11:09

David


The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.

For example, if your file consists solely of the byte 0x41 (A) repeated sixteen times, then instead of representing it as the 8-bit sequence 01000001 shorten it to the 1-bit sequence 0. Then the file can be represented by 0000000000000000 (sixteen 0s). So then the file of the byte 0x41 repeated sixteen times can be represented by the file consisting of the byte 0x00 repeated twice.

So what we have here is that for this file (0x41 repeated sixteen times) the bits 01000001 don't convey any additional information over the bit 0. So, in this case, we throw away the extraneous bits to obtain a shorter representation.

That is the core idea behind compression.

As another example, consider the eight byte pattern

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.

000 0x41
001 0x42
010 0x43
011 0x44
100 0x45
101 0x46
110 0x47
111 0x48

Now we can represent the above byte pattern by 00000101 00111001 01110111 (this is the three-byte pattern 0x05 0x39 0x77) repeated 2048 times.

But an even better approach is to represent the byte pattern

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

by the single bit 0. Then we can represent the above byte pattern by 0 repeated 2048 times which becomes the byte 0x00 repeated 256 times. Now we only need to store the dictionary

0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

and the byte pattern 0x00 repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.

That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.

See for example:

  1. Data compression
  2. Run length encoding
  3. Huffman compression
  4. Shannon-Fano coding
  5. LZW
like image 36
jason Avatar answered Sep 24 '22 11:09

jason