Why would gnu parallel chunking improve gzip's compression size?

Question

File under: "Unexpected Efficiency Dept."

The first 90 million numbers take up about 761MB, as output by:

 seq 90000000

According to man parallel, it can speed up gzip's archiving big files by chopping the input up, and using different CPUs to compress the chunks. So even though gzip is single-threaded this technique makes it multi-threaded:

seq 90000000  | parallel --pipe --recend '' -k gzip -9 >bigfile.gz

Took 46 seconds, on an Intel Core i3-2330M (4) @ 2.2GHz.

Pipe that to plain old gzip:

seq 90000000  | gzip -9 > bigfile2.gz

Took 80 seconds, on the same CPU. Now the surprise:

ls -log bigfile*.gz

Output:

-rw-rw-r-- 1 200016306 Jul  3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul  3 17:30 bigfile2.gz

300K larger? That didn't look right. First I checked with zdiff if the files had the same contents -- yes, the same. I'd have supposed any compressor would do better with a continuous data stream than a chunked one. Why isn't bigfile2.gz smaller than bigfile.gz?

Mark Adler · Accepted Answer

The reason is that for this particular, rather unusual input, smaller deflate blocks are better than larger ones. By default gzip uses larger deflate blocks, since that works best for normal input data. The parallel command is forcing a few smaller deflate blocks by breaking up the input every 1 MB, resulting in a small gain. Though most of the blocks are still the same size.

You can do much better by setting a smaller block size for every block by using zlib's memLevel parameter in deflateInit2(). Here I compress the same output in a single thread each time, using memLevel values from 9 to 2, where a smaller memLevel is a smaller deflate block size (note that zlib does a little better than your gzip at the default level):

9 - 199688429
8 - 198554111 (default)
7 - 191582070
6 - 184880482
5 - 181295029
4 - 180137425 (optimum for this input)
3 - 181176610
2 - 185759115

The optimum memLevel for this data turns out to be 4, for which the compressed data is 12 MB (9%) smaller than for the default memLevel of 8. For memLevel 8, the deflate block size is 16383 symbols, whereas for memLevel 4, the deflate block size is 1023 symbols. One symbol is either a literal byte or a match.

The improvement comes from the extremely regular nature of the input, resulting in a regular sequence of match and literal commands. The smaller the block size, the fewer such distinct commands that appear, which then takes fewer bits to code each of them. This is still true for memLevel 3, but by then the overhead of the code description at the beginning of each deflate block cancels the improvement from fewer distinct codes.

zopfli is a deflate compressor that optimizes the block size and the commands selected, and managed to compress it to 100,656,812 bytes. It took three and a half hours though! zopfli is invoked with pigz using compression level 11.

Why would gnu parallel chunking improve gzip's compression size?

Tags:

linux

shell

gzip

chunking

gnu-parallel

agc

1 Answers

Mark Adler

Recent Activity

Donate For Us

Why would gnu parallel chunking improve gzip's compression size?

Tags:

linux

shell

gzip

chunking

gnu-parallel

agc

1 Answers

Mark Adler

Related questions

Recent Activity

Donate For Us