Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deflate length of 258 double encoding

In Deflate algorithm there are two ways to encode a length of 258:

  1. Code 284 + 5 extra bits of all 1's

  2. Code 285 + 0 extra bits;

On first glance, this is not optimal, because the proper use of code 285 would allow a length of 259 be encoded;

Is this duality some specification mistake, not fixed because of compatibility reasons, or there are some arguments about it - for example length of 258 must be encoded with shorter code (0 extra bits) because of some reason?

like image 762
johnfound Avatar asked Nov 26 '14 15:11

johnfound


2 Answers

We may never know. The developer of the deflate format, Phil Katz, passed away many years ago at a young age.

My theory is that a match length was limited to 258 so that a match length in the range 3..258 could fit in a byte, encoded as 0..255. This format was developed around 1990, when this might make a difference in an assembler implementation.

like image 184
Mark Adler Avatar answered Nov 06 '22 15:11

Mark Adler


Adding a second answer here to underscore Mark's guess that allowing the length to be encoded in a byte is helpful to assembler implementations. At the time 8086 level assembler was still common and using the 8 bit form of registers gave you more of them to work with than using them in 16 bit size.

The benefit is even more pronounced on 8 bit processors such as the 6502. It starts with the length decoding. Symbols 257 .. 264 represent a match length of 3 .. 10 respectively. If you take the low byte of those symbols (1 .. 8) you get exactly 2 less than the match length.

A more complicated yet fairly easy to compute formula gives 2 less than the match length of symbols 265 through 284. 2 less than the match length of symbol 285 is 256. That doesn't fit in a byte but we can store 0 which turns out to be equivalent.

zlib6502 uses this for considerable advantage. It calculates the match length in inflateCodes_lengthMinus2. And once the back pointer into the window has been determined it copies the data like so:

    jsr copyByte
    jsr copyByte
inflateCodes_copyByte
    jsr copyByte
    dec inflateCodes_lengthMinus2
    bne inflateCodes_copyByte

It makes two explicit calls to copy a byte and then loops over the length less 2. Which works as you would expect for lengths 1 to 255. For length 0 it will actually iterate 256 times as we desire. The first time through the loop the length of 0 is decremented to 255 which is non-zero so the loop continues 255 more times for a total of 256.

I'd have to think that Phil Katz understood intuitively if not explicitly the benefits of keeping the length of matches within 8 bits.

like image 4
George Phillips Avatar answered Nov 06 '22 15:11

George Phillips