Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use multiple threads for zlib compression

I have a large chunk of data (around 2 GB) that needs to be compressed using zlib (deflate()). I am currently reading 500 kb of data at 1 go, compressing it and writing it to my output file.

With 1 thread, everything is fine. The data is compressed and I am able to write it and uncompress it back.

With 2 threads, the process hangs in deflate() call.

Here is the outline of the function that is called by my 2 zlib comp threads.

static z_stream z_str;

zlib_compress(...., bool last, bool first)
{

    if (first)
        deflateInit(&z_str, Z_DEFAULT_COMPRESSION);

    if (last)
        flush = Z_FINISH;
    else
        flush = Z_SYNC_FLUSH;

....
....
    status = deflate(&z_str, flush);
...
...
    if (last)
        deflateEnd(&z_str);

}

As I understand, both the calls are referring to the same zstream while calling deflate(), which is resulting in undesired behaviour.

I tried to take z_str as a local variable and modified the code accordingly. But while uncompressing, it is assuming 512 as the total size of the file which is actually just the first chunk of data.

Any idea how to achieve this?

like image 842
Sandeep Avatar asked Dec 25 '22 19:12

Sandeep


2 Answers

It is possible to have multiple threads compressing data simultaneously, as long as each thread has its own separate z_stream object. Each z_stream object should have deflateInit() called on it, then as many calls to deflate() as necessary, and then deflateEnd() called after all of the uncompressed data has been passed to deflate(). Using this technique, it would be straightforward to e.g. compress two different files at once.

However I suspect that what you are trying to do is speed up the compression of a single large file, no? In that case, you'll find that is not possible to do, at least not in the obvious way. The reason it's not possible is that the latter bytes of a deflated stream depend on the earlier bytes of that stream for their meaning -- which means that they cannot be generated until after all of the earlier bytes have been generated, which rules out generating the second half of the compressed file in parallel with the first half.

What you could do is generate two separate compressed files; one that is the compressed contents of the first half of the uncompressed file, and the other that is the compressed contents of the second half of the uncompressed file. That could be done in parallel since the two compressed streams would be fully independent of each other. Note that you would then need to write your own routine to uncompress those two files and concatenate the result back into a single uncompressed file again, since standard compression/decompression utilities would not be aware of this divide-and-conquer trick.

As pointed out by the original author of zlib (Adler), it is possible to compress large chunk of data in parallel as exemplified in pigz. Essentially you need to supply 32K of uncompressed data proceeding a particular chunk.

==Chunk 1===
       -32K-====Chunk 2=======
                       --32K--====Chunk 3====

Then you can combine the compressed data.

like image 169
Jeremy Friesner Avatar answered Jan 05 '23 09:01

Jeremy Friesner


As I understand, both the calls are referring to the same zstream while calling deflate(), which is resulting in undesired behaviour.

What did you expect to happen?

Each thread needs it's own z_stream structure to work with. Two threads accessing the same z_stream at the same time makes no sense.

like image 27
Mark Adler Avatar answered Jan 05 '23 10:01

Mark Adler