Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the maximum size of encoded dynamic Huffman tree as used by DEFLATE (zlib, gzip) format?

Section "3.2.7. Compression with dynamic Huffman codes (BTYPE=10)" of https://www.ietf.org/rfc/rfc1951.txt describes encoding of dynamic Huffman tree used during compression. What is the maximum size (in bits) of such encoded Huffman tree representation as may appear in DEFLATE bitstream? Extra points for backing a particular figure with external reference ;-).

This is a theoretical question to understand DEFLATE properties better. But of course it has practical applications, for example, "How big buffer should be used to guaranteedly decode Huffman tree?"

like image 300
pfalcon Avatar asked Aug 21 '14 16:08

pfalcon


1 Answers

An upper bound on the length of a dynamic block header can be readily computed from the reference you already provided. From RFC 1951, section 3.2.7 we can add up the bits:

3 + 5 + 5 + 4 + 19 * 3 + (286 + 30) * 7 = 2286 bits = 285.75 bytes.

(See calculation notes below for details.)

In practice you will never see one near as long as 286 bytes. More typical lengths are 60 to 90 bytes.

Here is a histogram of dynamic header block lengths from a gzipped source distribution of linux, linux-3.1.6.tar.gz:

dynamic block length histogram

They don't all look the same. Here is another for Archive.pax.gz (an application distribution):

another dynamic block length histogram

The bimodal shape is probably executables vs. text. Executables code all literal byte values, resulting in larger dynamic headers to describe codes for all of those values.


Calculation notes:

I deliberately did not add possible extra bits for symbols 16, 17, or 18, because the use of any of those codes, including their extra bits, would reduce the length of the header, not increase it. A 16 symbol would replace 21 to 42 bits with 9 bits, a 17 symbol would replace 21 to 70 bits with 10 bits, and an 18 symbol would replace 77 to 966 bits with 14 bits (where all symbols are assumed to be seven bits).

There are still 19 initial code lengths even if 16, 17, and 18 are not used, since those are stored first.

I limited the literal/length code lengths to 286 and the distance code lengths to 30, since a compliant inflator will reject values above that.

2286 is the lowest possible upper bound, since there is no constraint in the deflate format that the header be constructed to be optimal. It is possible to construct the code lengths code to, for example, have the lengths 4, 5, 8, and 9 all be 7-bit codes, and then use only those in the list of lengths to construct complete literal/length and distance codes. The code lengths code must also be complete, but that can be achieved by assigning shorter codes to unused lengths.

In short, it is possible to construct an entirely valid dynamic block header that is 2286 bits in length. In fact, here's one (there are many ways to do this):

ed fd 01 e0 38 70 1c 28 a7 fc 7e bf df ef f7 fb
fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e
bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df
ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef f7
fb fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb fd
7e bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e bf
df ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef
f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb
fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e
bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df
ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef f7
fb fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb fd
7e bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff f9 7c bf df ef f7 fb fd 7e bf df ef
f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef 23

That is a valid and complete deflate stream represented in hexadecimal. It consists of a single dynamic block, marked as the last block, with a 2286-bit dynamic header and a 9-bit end-of-block code, totaling 2295 bits, rounding up to 287 bytes. It decompresses to zero bytes with no errors.

like image 165
Mark Adler Avatar answered Sep 28 '22 01:09

Mark Adler