Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get the filesize of very large .gz file on a 64bit platform

According to the specifiction of gz the filesize is saved in the last 4bytes of a .gz file.

I have created 2 files with

dd if=/dev/urandom of=500M bs=1024 count=500000
dd if=/dev/urandom of=5G bs=1024 count=5000000

I gziped them

gzip 500M 5G

I checked the last 4 bytes doing

tail -c4 500M|od -I      (returns 512000000 as expected)
tail -c4 5G|od -I        (returns 825032704 as not expected)

It seems that hitting the invisible 32bit barrier, makes the value written into the ISIZE completely nonsense. Which is more annoying, than if they had used some error bit instead.

Does anyone know of a way to get the uncompressed .gz filesize from the .gz without extracting it?

thanks

specification: http://www.gzip.org/zlib/rfc-gzip.html

edit: if anyone to try it out, you could use /dev/zero instead of /dev/urandom

like image 982
monkeyking Avatar asked Dec 27 '09 09:12

monkeyking


1 Answers

There isn't one.

The only way to get the exact size of a compressed stream is to actually go and decompress it (even if you write everything to /dev/null and just count the bytes).

Its worth noting that ISIZE is defined as

ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.

in the gzip RFC so it isn't actually breaking at the 32-bit barrier, what you're seeing is expected behavior.

like image 178
Kevin Montrose Avatar answered Sep 17 '22 11:09

Kevin Montrose