I have a bunch of ridiculously big files (multiple gigabytes in size) that do have a really high compression ratio (1:200 or better). I have to process those and would like to at least show some kind of progress estimate. For that reason i'd like to know the size of the file inside the .gz, so i can compare it with what i pulled out already.
However, since unpacking the whole file in advance each time is rather prohibitive and a waste of time, i'd like to figure the size out without doing that.
I know it is possible. I can just open gzip files with Total Commander and the viewer plugin will show me the right size. (I know it's not unpacking because it shows me the size immediately, which wouldn't really be possible with a 10GB file inside the gzip.)
There probably are some header fields that contain that information.
However looking through the docs of various CPAN modules i couldn't find anything that fits the bill. IO::Uncompress::Gunzip lets me get at a header, but it doesn't contain any file size information.
Any suggestions?
You can use the pretty-print and white-space only options to estimate the compression of non-minified content. If you need an estimate: Start with 100 JS files that have gone through the same minification pipeline. For each file, compute the ratio in sizes between gzip -c "$f" | wc -c and wc -c "$f"
use LWP::Simple; use XML::Simple qw(:strict); use Data::Dumper; use DBI; use Getopt::Long; use IO::Uncompress::Gunzip qw($GunzipError); use IO::File; my $url = 'http://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-Modified.xml.gz'; my $file = 'nvdcve-2.0-Modified.
Just so there's a proper answer for this:
sub get_gz_size {
my ( $gz_file ) = @_;
my @raw = `gzip --list $gz_file`;
my $size = ( split " ", $raw[1] )[1];
return $size;
}
As described in the comments above, the last 4 bytes contain the isize
Here's some code I wrote to calculate the uncompressed bytes given a file path:
sub get_isize
{
my ($file) = @_;
my $isize_len = 4;
# create a handle we can seek
my $FH;
unless( open( $FH, '<:raw', $file ) )
{
die "Failed to open $file: $!";
}
my $io;
my $FD = fileno($FH);
unless( $io = IO::Handle->new_from_fd( $FD, 'r' ) )
{
die "Failed to create new IO::Handle for $FD: $!";
}
# seek back from EOF
unless( $io->IO::Seekable::seek( "-$isize_len", 2 ) )
{
die "Failed to seek $isize_len from EOF: $!"
}
# read from here into mod32_isize
my $mod32_isize;
unless( my $bytes_read = $io->read( $mod32_isize, $isize_len ) )
{
die "Failed to read $isize_len bytes; read $bytes_read bytes instead: $!";
}
# convert mod32 to decimal by unpacking value
my $dec_isize = unpack( 'V', $mod32_isize );
return $dec_isize;
}
For uncompressed files larger than 4Gb, you'll need to guess whether to add 4Gb to the isize retrieved, based upon the expected minimum compression factor.
use constant MIN_COMPRESS_FACTOR => 200;
my $outer_bytes = ( -s $path );
my $inner_bytes = get_isize( $path );
$bytes += 4294967296 if( $inner_bytes < $outerbytes * MIN_COMPRESS_FACTOR );
If your uncompressed file is larger than 4294967296 * 2, then you're going to have to guess how many multiples of 4294967296 to apply (although I've never tested this), however you'll need to have an accurate judge of the expected compression ratio for this to work out:
my $estimated_multiplier = int( ($outerbytes * MIN_COMPRESS_FACTOR) / 4294967296 );
$bytes += ( 4294967296 * $estimated_multiplier ) if( $estimated_multiplier );
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With