I'm trying to make a parsing script that parses a huge text file (2 million+ lines) that is gunzip compressed. I only want to parse a range of lines in the text file. So far I've used zgrep -n to find the two lines that mentions the string that I know will start and end the section of the file I'm interested.
In my test case file I am interested in only reading in lines 123080 to 139361. I've found Tie::File to access the file lines using the array object it returns, but unfortunately this won't work for the gun zipped file I'm working with.
Is there something like the following for a gunzipped file?
use Tie::File
tie @fileLinesArray, 'Tie::File', "hugeFile.txt.gz"
my $startLine = 123080;
my $endLine = 139361;
my $lineCount = $startLine;
while ($lineCount <= $endLine){
my $line = @fileLinesArray[$lineCount]
blah blah...
}
Use IO::Uncompress::Gunzip which is a core module:
use IO::Uncompress::Gunzip;
my $z = IO::Uncompress::Gunzip->new('file.gz');
$z->getline for 1 .. $start_line - 1;
for ($start_line .. $end_line) {
my $line = $z->getline;
...
}
Tie::File gets very slow and memory hungry when processing large files.
Tie::File
is a bad idea for large files, as it needs to store the whole file in memory at once. It is also an impractical, if not impossible idea for compressed files. Instead, you will want to operate on an input stream of your data. And if you are going to modify the data, an output stream to a new copy of the data. Perl has pretty good support for gzip compression through the PerlIO::gzip
layer, but you could also pipe data through one or two gzip
processes.
# I/O stream initialization
use PerlIO::gzip;
open my $input, "<:gzip", "data.gz";
open my $output. ">:gzip", "data.new.gz"; # if $output is needed
# I/O stream initialization without PerlIO::gzip
open my $input, "gzip -d data.gz |";
open my $output, "| gzip -c > data.new.gz";
Once the input (and optional output) streams are set up, you can use Perl's I/O facilities on them just like any other file handles.
# copy first $startLine lines unedited
while (<$input>) {
print $output $_;
last if $. >= $startLine;
}
while (my $line = <$input>) {
# blah blah blah
# manipulate $line
print $output $line;
last if $. >= $endLine;
}
print $output <$input>; # write remaining input to output stream
close $input;
close $output;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With