Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl reading only specific gz file lines

Tags:

gzip

perl

tie

I'm trying to make a parsing script that parses a huge text file (2 million+ lines) that is gunzip compressed. I only want to parse a range of lines in the text file. So far I've used zgrep -n to find the two lines that mentions the string that I know will start and end the section of the file I'm interested.

In my test case file I am interested in only reading in lines 123080 to 139361. I've found Tie::File to access the file lines using the array object it returns, but unfortunately this won't work for the gun zipped file I'm working with.

Is there something like the following for a gunzipped file?

use Tie::File
tie @fileLinesArray, 'Tie::File', "hugeFile.txt.gz"
my $startLine = 123080;

my $endLine = 139361;    
my $lineCount = $startLine;
while ($lineCount <= $endLine){
    my $line = @fileLinesArray[$lineCount]
    blah blah...
}
like image 477
Anfoni Avatar asked Jan 28 '23 01:01

Anfoni


2 Answers

Use IO::Uncompress::Gunzip which is a core module:

use IO::Uncompress::Gunzip;

my $z = IO::Uncompress::Gunzip->new('file.gz');
$z->getline for 1 .. $start_line - 1;
for ($start_line .. $end_line) {
    my $line = $z->getline;
    ...
}

Tie::File gets very slow and memory hungry when processing large files.

like image 106
choroba Avatar answered Jan 29 '23 15:01

choroba


Tie::File is a bad idea for large files, as it needs to store the whole file in memory at once. It is also an impractical, if not impossible idea for compressed files. Instead, you will want to operate on an input stream of your data. And if you are going to modify the data, an output stream to a new copy of the data. Perl has pretty good support for gzip compression through the PerlIO::gzip layer, but you could also pipe data through one or two gzip processes.

# I/O stream initialization
use PerlIO::gzip;
open my $input, "<:gzip", "data.gz";
open my $output. ">:gzip", "data.new.gz";    # if $output is needed

# I/O stream initialization without PerlIO::gzip
open my $input, "gzip -d data.gz |";
open my $output, "| gzip -c > data.new.gz";

Once the input (and optional output) streams are set up, you can use Perl's I/O facilities on them just like any other file handles.

# copy first $startLine lines unedited
while (<$input>) {
    print $output $_;
    last if $. >= $startLine;
}

while (my $line = <$input>) {
    # blah blah blah
    # manipulate $line
    print $output $line;
    last if $. >= $endLine;
}

print $output <$input>; # write remaining input to output stream
close $input;
close $output;
like image 22
mob Avatar answered Jan 29 '23 15:01

mob