What's the concept behind zip compression?

Tags:

compression

What's the concept behind zip compression? I can understand the concept of removing empty space etc, but presumably something has to be added to say how much/where that free space needs to be added back in during decompression?

What's the basic process for compressing a stream of bytes?

966

asked Sep 09 '09 13:09

Neil Barnwell

2 Answers

A good place to start would be to lookup the Huffman compression scheme. The basic idea behind huffman is that in a given file some bytes appear more frequently then others (in a plaintext file many bytes won't appear at all). Rather then spend 8 bits to encode every byte, why not use a shorter bit sequence to encode the most common characters, and a longer sequences to encode the less common characters (these sequences are determined by creating a huffman tree).

Once you get a handle on using these trees to encode/decode files based on character frequency, imagine that you then start working on word frequency - instead of encoding "they" as a sequence of 4 characters, why not consider it to be a single character due to its frequency, allowing it to be assigned its own leaf in the huffman tree. This is more or less the basis of ZIP and other lossless type compression - they look for common "words" (sequences of bytes) in a file (including sequences of just 1 byte if common enough) and use a tree to encode them. The zip file then only needs to include the tree info (a copy of each sequence and the number of times it appears) to allow the tree to be reconstructed and the rest of the file to be decoded.

Follow up:

To better answer the original question, the idea behind lossless compression is not so much to remove empty space, but to remove redundent information.

If you created a database to store music lyrics, you'd find a lot of space was being used to store the chorus which repeats several times. Instead of using all that space, you could simply place the word CHORUS before the first instance of the chorus lines, and then every time the chorus is to be repeated, just use CHORUS as a place holder (in fact this is pretty much the idea behind LZW compression - in LZW each line of the song would have a number shown before it. If a line repeats later in the song, rather then write out the whole line only the number is shown)

195

answered Sep 21 '22 11:09

David

The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.

For example, if your file consists solely of the byte 0x41 (A) repeated sixteen times, then instead of representing it as the 8-bit sequence 01000001 shorten it to the 1-bit sequence 0. Then the file can be represented by 0000000000000000 (sixteen 0s). So then the file of the byte 0x41 repeated sixteen times can be represented by the file consisting of the byte 0x00 repeated twice.

So what we have here is that for this file (0x41 repeated sixteen times) the bits 01000001 don't convey any additional information over the bit 0. So, in this case, we throw away the extraneous bits to obtain a shorter representation.

That is the core idea behind compression.

As another example, consider the eight byte pattern

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.

Now we can represent the above byte pattern by 00000101 00111001 01110111 (this is the three-byte pattern 0x05 0x39 0x77) repeated 2048 times.

But an even better approach is to represent the byte pattern

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

by the single bit 0. Then we can represent the above byte pattern by 0 repeated 2048 times which becomes the byte 0x00 repeated 256 times. Now we only need to store the dictionary

0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

and the byte pattern 0x00 repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.

That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.

See for example:

Data compression
Run length encoding
Huffman compression
Shannon-Fano coding
LZW

answered Sep 24 '22 11:09

jason

Related questions
                            
                                How to Compress a JSONObject send it over Http in Android?
                            
                                Java: Error creating a GZIPInputStream: Not in GZIP format
                            
                                Store mongodb data in compressed format
                            
                                Compression algorithms for numbers only
                            
                                .NET - Stream DataSet (of XML data) to ZIP file?
                            
                                SQL Server 2008 - how to zip backup files and move to remote server
                            
                                Compress a series of 1s and 0s into the shortest possible ascii string
                            
                                Java Deflater strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY
                            
                                Compiling ZipArchive in XCode 4 project
                            
                                How can I set a file's compression attribute in Delphi?
                            
                                Generate zip stream without using temp files
                            
                                What Is The Best Python Zip Module To Handle Large Files?
                            
                                Shannon's entropy formula. Help my confusion
                            
                                Effectively compress strings of 10-1000 characters in Java?
                            
                                File compression with C++
                            
                                HTTP request compression
                            
                                How to compress a file with shutil.make_archive in python?
                            
                                What're lzo and lzf, and the differences?
                            
                                Convert ZipOutputStream to ByteArrayInputStream
                            
                                Create a compress function in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With