Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the size of the file inside a GZIP file

Tags:

java

gzip

Is there a way to find out the size of the original file which is inside a GZIP file in java?

As in, I have a file a.txt of 15 MB which has been GZipped to a.gz of size 3MB. I want to know the size of a.txt present inside a.gz, without unzipping a.gz.

like image 388
manil Avatar asked Mar 15 '12 06:03

manil


People also ask

How do I check the size of a gzip file?

http://refresh-sf.com/ will give you minification and gzip ratios & sizes. Save this answer. Show activity on this post. Save this answer.

How do I see the size of a compressed file?

Use the GetCompressedFileSize function to obtain the compressed size of a file. If the file is compressed, its compressed size will be less than its uncompressed size. Use the GetFileSize function to determine the uncompressed size of a file.

How do I determine the uncompressed size of a zip file without decompressing it?

If you type unzip -l <zipfile> , it prints a listing of files within the zip, with their uncompressed sizes, then the total uncompressed size of all of them. This is human-readable output, but you can get a machine-readable number using unzip -l <zipfile> | tail -n1 | awk '{ print $1 }' . Save this answer.


2 Answers

There is no truly reliable way, other than gunzipping the stream. You do not need to save the result of the decompression, so you can determine the size by simply reading and decoding the entire file without taking up space with the decompressed result.

There is an unreliable way to determine the uncompressed size, which is to look at the last four bytes of the gzip file, which is the uncompressed length of that entry modulo 232 in little endian order.

It is unreliable because a) the uncompressed data may be longer than 232 bytes, and b) the gzip file may consist of multiple gzip streams, in which case you would find the length of only the last of those streams.

If you are in control of the source of the gzip files, you know that they consist of single gzip streams, and you know that they are less than 232 bytes uncompressed, then and only then can you use those last four bytes with confidence.

pigz (which can be found at http://zlib.net/pigz/ ) can do it both ways. pigz -l will give you the unreliable length very quickly. pigz -lt will decode the entire input and give you the reliable lengths.

like image 61
Mark Adler Avatar answered Sep 30 '22 22:09

Mark Adler


Below is one approach for this problem - certainly not the best approach, however since Java doesn't provide an API method for this (unlike that when dealing with Zip files), it's the only way I could think of, apart from one of the comments above, which talked about reading in the last 4 bytes (assuming the file is under 2Gb in size).

GZIPInputStream zis = new GZIPInputStream(new FileInputStream(new File("myFile.gz")));
long size = 0;

while (zis.available() > 0)
{
  byte[] buf = new byte[1024];
  int read = zis.read(buf);
  if (read > 0) size += read;
}

System.out.println("File Size: " + size + "bytes");
zis.close();

As you can see, the gzip file is read in, and the number of bytes read in is totalled indicating the uncompressed file size.

While this method does work, I really cannot recommend using it for very large files, as this may take several seconds. (unless time is not really too much of a constraint)

like image 39
Crollster Avatar answered Oct 01 '22 00:10

Crollster