How to use multiple threads for zlib compression (same input source)

Tags:

My goal is to compress the data of the same source in parallel threads. I have defined jobs which are in a list, these jobs have the read information(500kb-1MB in each job).

My compressor threads will compress each block's data using ZLIB and store it in the outbuf of the corresponding jobs.

Now, I want to ,merge all this and create one output file which is of standard ZLIB format.

From the ZLIB RFC and after browsing the source of pigzee, I understand that

A ZLIB header is like below

     +---+---+
     |CMF|FLG| (2 bytes)
     +---+---+
     +---+---+---+---+
     |     DICTID    | (4 bytes. Present only when FLG.FDICT is set)
     +---+---+---+---+
     +=====================+
     |...compressed data...| (variable size of data)
     +=====================+
     +---+---+---+---+
     |     ADLER32   |  (4 bytes of variable data)
     +---+---+---+---+

In my case, there is no dictionary as well.

So when I am combining two compressed units, the header of all the units is same.

Hence, I am doing the following operaions.

For the first unit, I am writing the header + compressed data.
For the second unit to the last unit, I am writing only the compressed data (No header and no trailer)
After all the units are done, I am using adlrer32_combine()and converting the checksum's of all the jobs output data into one final adler 32 and then I am updating the output file with it at the bottom.

But the problem is that, I get an error during deflate saying the data is invalid at some places.

Have someone already tried something like this? Any relevant information will be really helpful.

730

asked Jun 12 '15 01:06

Sandeep

1 Answers

You cannot simply concatenate raw deflate data streams. Each deflate stream is self-terminating, and so decompression would end at the end of the first stream.

You need to look more carefully at the pigz code for how to merge deflate streams. You can use Z_SYNC_FLUSH to complete the last block and bring it to a byte boundary without ending the deflate stream. Then you can complete the deflate stream, and strip off the final empty block marked as the end block. Except for the last deflate stream which should terminate normally. Then you can concatenate the series of n-1 unterminated deflate streams and the last 1 terminating deflate stream.

answered Oct 19 '22 01:10

Mark Adler

Related questions
                            
                                Using socat for raw serial connection
                            
                                Busybox env does not show LD_LIBRARY_PATH
                            
                                linux uinput: simple example?
                            
                                freetds and pyodbc failing to connect
                            
                                bool array vs bit array in C
                            
                                CUPS send multiple jobs to IPP printer
                            
                                While signal not received?
                            
                                Significance of MTU for loopback interface
                            
                                Split variable into multiple variables
                            
                                How do I find the inode of a TCP socket?
                            
                                From where does gdb take the code lines?
                            
                                Possible to use a 9 Pin Serial port as "GPIO" using ioctl()?
                            
                                error while loading CharSequence in Scala 2.11.4 and sbt 0.12.4
                            
                                PHP Fatal Error: Class 'MongoClient' not found
                            
                                Cannot get Pandas to install ! Help! (pip install pandas)
                            
                                Java Creating files and directories with a certain owner (user/group) [duplicate]
                            
                                Which section does objdump disassemble by default
                            
                                When a process forks, would the shared library .so still in the address space? And would the constructor be executed again?
                            
                                Compile a C program in Linux using shared library [duplicate]
                            
                                Run Linux/MQSC commands from mq client

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use multiple threads for zlib compression (same input source)

Tags:

linux

multithreading

compression

zlib

Sandeep

People also ask

1 Answers

Mark Adler

Recent Activity

Donate For Us