Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use multiple threads for zlib compression (same input source)

My goal is to compress the data of the same source in parallel threads. I have defined jobs which are in a list, these jobs have the read information(500kb-1MB in each job).

My compressor threads will compress each block's data using ZLIB and store it in the outbuf of the corresponding jobs.

Now, I want to ,merge all this and create one output file which is of standard ZLIB format.

From the ZLIB RFC and after browsing the source of pigzee, I understand that

A ZLIB header is like below

     +---+---+
     |CMF|FLG| (2 bytes)
     +---+---+
     +---+---+---+---+
     |     DICTID    | (4 bytes. Present only when FLG.FDICT is set)
     +---+---+---+---+
     +=====================+
     |...compressed data...| (variable size of data)
     +=====================+
     +---+---+---+---+
     |     ADLER32   |  (4 bytes of variable data)
     +---+---+---+---+

In my case, there is no dictionary as well.

So when I am combining two compressed units, the header of all the units is same.

Hence, I am doing the following operaions.

  1. For the first unit, I am writing the header + compressed data.

  2. For the second unit to the last unit, I am writing only the compressed data (No header and no trailer)

  3. After all the units are done, I am using adlrer32_combine()and converting the checksum's of all the jobs output data into one final adler 32 and then I am updating the output file with it at the bottom.

But the problem is that, I get an error during deflate saying the data is invalid at some places.

Have someone already tried something like this? Any relevant information will be really helpful.

like image 730
Sandeep Avatar asked Jun 12 '15 01:06

Sandeep


People also ask

Is zlib compatible with gzip?

For applications that require data compression, the functions in this module allow compression and decompression, using the zlib library. The zlib library has its own home page at https://www.zlib.net.

What compression algorithm does zlib use?

Algorithm. As of September 2018, zlib only supports one algorithm, called DEFLATE, which uses a combination of a variation of LZ77 (Lempel–Ziv 1977) and Huffman coding. This algorithm provides good compression on a wide variety of data with minimal use of system resources.

Is deflate same as zlib?

zlib is now in wide use for data transmission and storage. For example, most HTTP transactions by servers and browsers compress and decompress the data using zlib, specifically HTTP header Content-Encoding: deflate means deflate compression method wrapped inside the zlib data format.

How does zlib compression work?

LZ77 compression algorithm works by using a sliding window to find sequences of data that are repeated, and encoding each repeated sequence by a pair of numbers called a length-distance pair .


1 Answers

You cannot simply concatenate raw deflate data streams. Each deflate stream is self-terminating, and so decompression would end at the end of the first stream.

You need to look more carefully at the pigz code for how to merge deflate streams. You can use Z_SYNC_FLUSH to complete the last block and bring it to a byte boundary without ending the deflate stream. Then you can complete the deflate stream, and strip off the final empty block marked as the end block. Except for the last deflate stream which should terminate normally. Then you can concatenate the series of n-1 unterminated deflate streams and the last 1 terminating deflate stream.

like image 85
Mark Adler Avatar answered Oct 19 '22 01:10

Mark Adler