Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Deflater strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY

I'm trying to find a balance between performance and degree of compression when gzipping a Java webapp response.

In looking at the Deflater class, I can set a level and a strategy. The levels are self explanatory - BEST_SPEED to BEST_COMPRESSION.

I'm not sure regarding the strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY

I can make some sense from the Javadoc but I was wondering if someone had used a specific strategy in their apps and if you saw any difference in terms of performance / degree of compression.

like image 301
Keyur Avatar asked Mar 26 '10 22:03

Keyur


1 Answers

The strategy options mentioned in the Java Deflater originated in the zlib (C) implementation of ZLIB and (RFC 1950) and DEFLATE (1951), I believe. They are present in virtually all compression libraries that implement DEFLATE.

To understand what they mean, you need to know a little about DEFLATE. The compression algorithm combines LZ77 and Huffman coding. The basics are:

  • LZ77 compression works by finding sequences of data that are repeated. Implementations typically use a "sliding window" of between 1k and 32k, to keep track of data that went before. For any repeats in the original data, instead of inserting the repeated data in the output, the LZ77 compression inserts a "back-reference". Imagine the back reference saying "here, insert the same data you saw 8293 bytes ago, for 17 bytes". The back-ref is encoded as this pair of numbers: a length - in this case 17 - and a distance (or offset) - in this case, 8293.

  • Huffman coding substitutes codes for the actual data. When the data says X, the Huffman code says Y. This obviously helps compression only if the substitute is shorter than the original. (a counter-example is in the Jim Carrey movie Yes Man, when Norm uses "Car" for a shortname for Carl. Carl points out that Carl is already pretty short.) The Huffman encoding algorithm does a frequency analysis, and uses the shortest substitutes for the data sequences that occur most often.


Deflate combines these, so that one can use Huffman codes on LZ77 back-refs. The strategy options on various DEFLATE/ZLIB compressors just tells the library how much to weight Huffman versus LZ77.

  • FILTERED usually means the LZ77 matches are stopped at a length of 5. So when the documentation says

    Use (Filtered) for data produced by a filter (or predictor), ... Filtered data consists mostly of small values with a somewhat random distribution.

(from the zlib man page)
...my reading of the code says that it does LZ77 matching, but only up to sequences of 5 or fewer bytes. That's what the doc means by "small values" I guess. But the number 5 isn't mentioned in the doc, so there's no guarantee that number won't be changed from rev to rev, or from one implementation of ZLIB/DEFLATE to another (like the C version and the Java version).

  • HUFFMAN_ONLY says, only do the substitution codes based on frequency analysis. HUFFMAN_ONLY is very very fast, but not very effective in compression for most data. Unless you have a very small range of byte values (for example, if bytes in your actual datastream take one of only 20 of the possible 255 values), or have extreme requirements for speed in compression at the expense of size, HUFFMAN_ONLY will not be what you want.

  • DEFAULT combines the two in the way the authors expected it would be most effective for most applications.


As far as I know, within DEFLATE there is no way to do only LZ77. There is no LZ77_ONLY strategy. But of course you could build or acquire your own LZ77 encoder and that would be "LZ77 only".


Keep in mind that the strategy never affects correctness of compression; it affects only and operation of it, and the performance of it, either in speed or size.


There are other ways to tweak the compressor. One is to set the size of the LZ77 sliding window. In the C library, this is specified with a "Window bits" option. If you understand LZ77, then you know that a smaller window means less searching back, which means faster compression, at the expense of missing some matches. This is often the more effective knob to turn when compressing.


The bottom line is that, for the 80% case, you don't care to tweak strategy. You might be interested in fiddling with window bits, just to see what happens. But only do that when you've done everything else you need to do in your app.


reference:
How DEFLATE works, by Antaeus Feldspar

like image 137
Cheeso Avatar answered Nov 09 '22 02:11

Cheeso