Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - Parallelizing Gzip

I was assigned to parallelize GZip in Java 7, and I am not sure which is possible.

The assignment is:

  • Parallelize gzip using a given number of threads
  • Each thread takes a 1024 KiB block, using the last 32 KiB block from the previous block as a dictionary. There is an option to use no dicitionary
  • Read from Stdin and stdout

What I have tried:

  • I have tried using GZIPOutputStream, but there doesn't seem to be a way to isolate and parallelize the deflate(), nor can I access the deflater to alter the dictionary. I tried extending off of GZIPOutputStream, but it didn't seem to act as I wanted to, since I still couldn't isolate the compress/deflate.
  • I tried using Deflater with wrap enabled and a FilterOutputStream to output the compressed bytes, but I wasn't able to get it to compress properly in GZip format. I made it so each thread had a compressor that will write to a byte array, then it will write to the OutputStream.

I am not sure if I am did my approaches wrong or took the wrong approaches completely. Can anyone point me the right direction for which classes to use for this project?

like image 576
am3692 Avatar asked Oct 22 '11 07:10

am3692


4 Answers

A correct parallel implementation which does exactly what you ask is here:

https://github.com/shevek/parallelgzip

like image 125
Shevek Avatar answered Nov 16 '22 03:11

Shevek


Yep, zipping a file with dictionary cannt be parallelized, as everything depends on everything. Maybe your teacher asked you to parallelize the individual gzipping of multiple files in a folder? That would be a great example of parallelized work.

like image 34
Shivan Dragon Avatar answered Nov 16 '22 04:11

Shivan Dragon


To make a process concurrent, you need to have portions of code which can run concurrently and independently. Most compression algorithms are designed to be run sequentially, where every byte depends on every byte has comes before.

The only way to do compression concurrently is to change the algorythm (making it incompatible with existing approaches)

like image 43
Peter Lawrey Avatar answered Nov 16 '22 05:11

Peter Lawrey


I think you can do it by inserting appropriate resets in the compression stream. The idea is that the underlying compression engine used in gzip allows the deflater to be reset, with an aim that it makes it easier to recover from stream corruption, though at a cost of making the compression ratio worse. After reset, the deflater will be in a known state and thus you could in fact start from that state (which is independent of the content being compressed) in multiple threads (and from many locations in the input data, of course) produce a compressed chunk and include the data produced when doing the following reset so that it takes the deflater back to the known state. Then you've just to reassemble the compressed pieces into the overall compressed stream. “Simple!” (Hah!)

I don't know if this will work, and I suspect that the complexity of the whole thing will make it not a viable choice except when you're compressing single very large files. (If you had many files, it would be much easier to just compress each of those in parallel.) Still, that's what I'd try first.

(Also note that the gzip format is just a deflated stream with extra metadata.)

like image 1
Donal Fellows Avatar answered Nov 16 '22 05:11

Donal Fellows