Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Uncompressed file size using zlib's gzip file access function

Tags:

c++

c

gzip

zlib

Using linux command line tool gzip I can tell the uncompressed size of a compress file using gzip -l.

I couldn't find any function like that on zlib manual section "gzip File Access Functions".

At this link, I found a solution http://www.abeel.be/content/determine-uncompressed-size-gzip-file that involves reading the last 4 bytes of the file, but I am avoiding it right now because I prefer to use lib's functions.

like image 836
André Puel Avatar asked Feb 09 '12 10:02

André Puel


1 Answers

There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.

First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)

Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)

Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.

So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.

pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.

like image 98
Mark Adler Avatar answered Nov 06 '22 18:11

Mark Adler