I have a large chunk of data (around 2 GB) that needs to be compressed using zlib (deflate()). I am currently reading 500 kb of data at 1 go, compressing it and writing it to my output file.
With 1 thread, everything is fine. The data is compressed and I am able to write it and uncompress it back.
With 2 threads, the process hangs in deflate() call.
Here is the outline of the function that is called by my 2 zlib comp threads.
static z_stream z_str;
zlib_compress(...., bool last, bool first)
{
if (first)
deflateInit(&z_str, Z_DEFAULT_COMPRESSION);
if (last)
flush = Z_FINISH;
else
flush = Z_SYNC_FLUSH;
....
....
status = deflate(&z_str, flush);
...
...
if (last)
deflateEnd(&z_str);
}
As I understand, both the calls are referring to the same zstream while calling deflate(), which is resulting in undesired behaviour.
I tried to take z_str as a local variable and modified the code accordingly. But while uncompressing, it is assuming 512 as the total size of the file which is actually just the first chunk of data.
Any idea how to achieve this?
It is possible to have multiple threads compressing data simultaneously, as long as each thread has its own separate z_stream object. Each z_stream object should have deflateInit() called on it, then as many calls to deflate() as necessary, and then deflateEnd() called after all of the uncompressed data has been passed to deflate(). Using this technique, it would be straightforward to e.g. compress two different files at once.
However I suspect that what you are trying to do is speed up the compression of a single large file, no? In that case, you'll find that is not possible to do, at least not in the obvious way. The reason it's not possible is that the latter bytes of a deflated stream depend on the earlier bytes of that stream for their meaning -- which means that they cannot be generated until after all of the earlier bytes have been generated, which rules out generating the second half of the compressed file in parallel with the first half.
What you could do is generate two separate compressed files; one that is the compressed contents of the first half of the uncompressed file, and the other that is the compressed contents of the second half of the uncompressed file. That could be done in parallel since the two compressed streams would be fully independent of each other. Note that you would then need to write your own routine to uncompress those two files and concatenate the result back into a single uncompressed file again, since standard compression/decompression utilities would not be aware of this divide-and-conquer trick.
As pointed out by the original author of zlib (Adler), it is possible to compress large chunk of data in parallel as exemplified in pigz. Essentially you need to supply 32K of uncompressed data proceeding a particular chunk.
==Chunk 1===
-32K-====Chunk 2=======
--32K--====Chunk 3====
Then you can combine the compressed data.
As I understand, both the calls are referring to the same zstream while calling deflate(), which is resulting in undesired behaviour.
What did you expect to happen?
Each thread needs it's own z_stream
structure to work with. Two threads accessing the same z_stream
at the same time makes no sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With