Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do i tell in Perl what the size of a file inside a gzip archive is without unpacking the whole file?

Tags:

gzip

perl

I have a bunch of ridiculously big files (multiple gigabytes in size) that do have a really high compression ratio (1:200 or better). I have to process those and would like to at least show some kind of progress estimate. For that reason i'd like to know the size of the file inside the .gz, so i can compare it with what i pulled out already.

However, since unpacking the whole file in advance each time is rather prohibitive and a waste of time, i'd like to figure the size out without doing that.

I know it is possible. I can just open gzip files with Total Commander and the viewer plugin will show me the right size. (I know it's not unpacking because it shows me the size immediately, which wouldn't really be possible with a 10GB file inside the gzip.)

There probably are some header fields that contain that information.

However looking through the docs of various CPAN modules i couldn't find anything that fits the bill. IO::Uncompress::Gunzip lets me get at a header, but it doesn't contain any file size information.

Any suggestions?

like image 553
Mithaldu Avatar asked Feb 09 '11 15:02

Mithaldu


People also ask

How do I check the size of a gzip file?

You can use the pretty-print and white-space only options to estimate the compression of non-minified content. If you need an estimate: Start with 100 JS files that have gone through the same minification pipeline. For each file, compute the ratio in sizes between gzip -c "$f" | wc -c and wc -c "$f"

How do I unzip a .GZ file in Perl?

use LWP::Simple; use XML::Simple qw(:strict); use Data::Dumper; use DBI; use Getopt::Long; use IO::Uncompress::Gunzip qw($GunzipError); use IO::File; my $url = 'http://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-Modified.xml.gz'; my $file = 'nvdcve-2.0-Modified.


2 Answers

Just so there's a proper answer for this:

sub get_gz_size {
    my ( $gz_file ) = @_;
    my @raw = `gzip --list $gz_file`;
    my $size = ( split " ", $raw[1] )[1];
    return $size;
}
like image 128
Mithaldu Avatar answered Nov 15 '22 04:11

Mithaldu


As described in the comments above, the last 4 bytes contain the isize

Here's some code I wrote to calculate the uncompressed bytes given a file path:

sub get_isize
{
   my ($file) = @_;

   my $isize_len = 4;

   # create a handle we can seek
   my $FH;
   unless( open( $FH, '<:raw', $file ) )
   {
      die "Failed to open $file: $!";
   }
   my $io;
   my $FD = fileno($FH);
   unless( $io = IO::Handle->new_from_fd( $FD, 'r' ) )
   {
      die "Failed to create new IO::Handle for $FD: $!";
   }

   # seek back from EOF
   unless( $io->IO::Seekable::seek( "-$isize_len", 2 ) ) 
   {
      die "Failed to seek $isize_len from EOF: $!"
   }

   # read from here into mod32_isize
   my $mod32_isize;
   unless( my $bytes_read = $io->read( $mod32_isize, $isize_len ) )
   {
      die "Failed to read $isize_len bytes; read $bytes_read bytes instead: $!";
   }

   # convert mod32 to decimal by unpacking value
   my $dec_isize = unpack( 'V', $mod32_isize );

   return $dec_isize;
}

For uncompressed files larger than 4Gb, you'll need to guess whether to add 4Gb to the isize retrieved, based upon the expected minimum compression factor.

use constant MIN_COMPRESS_FACTOR => 200;
my $outer_bytes = ( -s $path );
my $inner_bytes = get_isize( $path );
$bytes += 4294967296 if( $inner_bytes < $outerbytes * MIN_COMPRESS_FACTOR );

If your uncompressed file is larger than 4294967296 * 2, then you're going to have to guess how many multiples of 4294967296 to apply (although I've never tested this), however you'll need to have an accurate judge of the expected compression ratio for this to work out:

my $estimated_multiplier = int( ($outerbytes * MIN_COMPRESS_FACTOR) / 4294967296 );
$bytes += ( 4294967296 * $estimated_multiplier ) if( $estimated_multiplier );
like image 24
errant.info Avatar answered Nov 15 '22 05:11

errant.info