Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it safe to call ICsharpCode.SharpZipLib in parallel on multiple threads

We are currently using for compression the GZipOutputStream class of ICsharpCode.SharpZipLib library. We do it from a single thread.

I want to split my input data stream into chunks and compress them in parallel. I'm worried though that this library may have some statics inside which will be overwritten from multiple threads and therefore corrupt the resulting stream.

Any thoughts will be appreciated.

like image 635
farfareast Avatar asked Dec 17 '22 14:12

farfareast


1 Answers

This is a really interesting question. Compression is highly CPU intensive, relying on lots of searching and comparisons. So it's very appropriate to want to parallelize it, when you've got multiple CPUs with unimpeded memory access.

There is a class called ParallelDeflateOutputStream within the DotNetZip library that does what you are describing. The class is documented here.

It can be used only for compression - no decompression. Also it is strictly an output stream - you cannot read in order to compress. Considering these constraints, it is basically a DeflateOutputStream, that internally uses multiple threads.

The way it works: It breaks up the incoming stream into chunks, then drops off each chunk into a separate worker thread to be compressed individually. Then it merges all those compressed streams back into one ordered stream at the end.

Suppose the "chunk" size maintained by the stream is N bytes. As the caller invokes Write(), data is buffered into a bucket or chunk. Inside the Stream.Write() method, when the first "bucket" is full, it calls ThreadPool.QueueUserWorkItem, allocating the bucket to the workitem. Subsequent writes into the stream begin filling the next bucket, and when that is full, Stream.Write() calls QUWI again. Each worker thread compresses its bucket, using a "Flush Type" of Sync (see the deflate spec), and then marks its compressed blob ready for output. This various outputs are then re-ordered (because chunk n does not necessarily get compressed before chunk n+1), and written to the captive output stream. As each bucket is written, it is marked empty, ready to be re-filled by the next Stream.Write(). Each chunk must be compressed with the flush type of Sync in order to allow their re-combination via simple concatenation, for the combined bytestream to be a legal DEFLATE Stream. The final chunk needs Flush type = Finish.

The design of this stream means that callers don't need to write with multiple threads. Callers just create the stream as normal, like the vanilla DeflateStream used for output, and write into it. The stream object uses multiple threads, but your code doesn't interface directly with them. The code for a "user" of the ParallelDeflateOutputStream looks like this:

using (FileStream raw = new FileStream(CompressedFile, FileMode.Create))
{
    using (FileStream input = File.OpenRead(FileToCompress))
    {
        using (var compressor = new Ionic.Zlib.ParallelDeflateOutputStream(raw))
        {
            // could tweak params of parallel deflater here
            int n;
            var buffer = new byte[8192];
            while ((n = input.Read(buffer, 0, buffer.Length)) != 0)
            {
                compressor.Write(buffer, 0, n);
            }                    
        }
    }
}

It was designed for use within the DotNetZip ZipFile class, but it is quite usable as a standalone compressing output stream. The resulting stream can be de-DELFATED (inflated?) with any inflater. The result is fully compliant to the spec.

The stream is tweakable. You can set the size of the buffers it uses, and the level of parallelism. It doesn't create buckets without bound, because for large streams (gb scale and so on) that would cause out of memory condiitons. So there's a fixed limit to the number of buckets, and therefore the degree of parallelism, that can be supported.

On my dual-core machine, this stream class nearly doubled the compression speed of large (100mb and larger) files, when compared to the standard DeflateStream. I don't have any larger multi-core machines so I couldn't test it further. The tradeoff is that the parallel implementation uses more CPU and more memory, and also compresses slightly less efficiently (1% less for large files) because of the sync framing I described above. The performance advantage will vary depending on the I/O throughput on your output stream, and whether the storage can keep up with the parallel compressor threads.


Caveat:
It is a DEFLATE stream, not GZIP. For the differences, read RFC 1951 (DEFLATE) and RFC 1952 (GZIP).

But if you really NEED gzip, the source for this stream is available, so you can view it and maybe get some ideas for yourself. GZIP is really just a wrapper on top of DEFLATE, with some additional metadata (like Adler checksum, and so on - see the spec). It seems to me that it would not be very difficult to build a ParallelGzipOutputStream, but it may not be trivial, either.

The trickiest part for me was getting the semantics of Flush() and Close() to work properly.


EDIT

Just for fun, I built a ParallelGZipOutputStream, that basically does what I described above, for GZip. It uses .NET 4.0's Tasks in lieu of QUWI to handle the parallel compression. I tested it just now on a 100mb text file generated via a Markov Chain engine. I compared the results of that class against some other options. Here's what it looks like:

uncompressed: 104857600
running 2 cycles, 6 Flavors

System.IO.Compression.GZipStream:  .NET 2.0 builtin
  compressed: 47550941
  ratio     : 54.65%
  Elapsed   : 19.22s

ICSharpCode.SharpZipLib.GZip.GZipOutputStream:  0.86.0.518
  compressed: 37894303
  ratio     : 63.86%
  Elapsed   : 36.43s

Ionic.Zlib.GZipStream:  DotNetZip v1.9.1.5, CompLevel=Default
  compressed: 37896198
  ratio     : 63.86%
  Elapsed   : 39.12s

Ionic.Zlib.GZipStream:  DotNetZip v1.9.1.5, CompLevel=BestSpeed
  compressed: 47204891
  ratio     : 54.98%
  Elapsed   : 15.19s

Ionic.Exploration.ParallelGZipOutputStream: DotNetZip v1.9.1.5, CompLevel=Default
  compressed: 39524723
  ratio     : 62.31%
  Elapsed   : 20.98s

Ionic.Exploration.ParallelGZipOutputStream:DotNetZip v1.9.1.5, CompLevel=BestSpeed
  compressed: 47937903
  ratio     : 54.28%
  Elapsed   : 9.42s

Conclusions:

  1. The GZipStream that's builtin to .NET is pretty fast. It's also not very efficient, and it is not tunable.

  2. The "BestSpeed" on the vanilla (non-parallelized) GZipStream in DotNetZip is about 20% faster than the .NET builtin stream, and gives about the same compression.

  3. Using multiple Tasks for compression can cut about 45% off the time required on my dual-core laptop (3gb RAM), comparing the vanilla DotNetZip GZipStream to the parallel one. I suppose the time savings would be higher for machines with more cores.

  4. There is a cost to parallel GZIP - the framing increases the size of the compressed file by about 4%. This won't change with the number of cores used.

The resulting .gz file can be decompressed by any GZIP tool.

like image 144
Cheeso Avatar answered Apr 19 '23 06:04

Cheeso