Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl6: large gzipped files read line by line

Tags:

gzip

raku

I'm trying to read a gz file line by line in Perl6, however, I'm getting blocked:

  1. How to read gz file line by line in Perl6 however, this method, reading everything into :out uses far too much RAM to be usable except on very small files.

  2. I don't understand how to use Perl6's Compress::Zlib to get everything line by line, although I opened an issue on their github https://github.com/retupmoca/P6-Compress-Zlib/issues/17

  3. I'm trying Perl5's Compress::Zlib to translate this code, which works perfectly in Perl5:

use Compress::Zlib;
my $file = "data.txt.gz";
my $gz = gzopen($file, "rb") or die "Error reading $file: $gzerrno";

while ($gz->gzreadline($_) > 0) {
    # Process the line read in $_
}

die "Error reading $file: $gzerrno" if $gzerrno != Z_STREAM_END ;
$gz->gzclose() ;

to something like this using Inline::Perl5 in Perl6:

use Compress::Zlib:from<Perl5>;
my $file = 'chrMT.1.vcf.gz';
my $gz = Compress::Zlib::new(gzopen($file, 'r');
while ($gz.gzreadline($_) > 0) {
  print $_;
}
$gz.gzclose();

but I can't see how to translate this :(

  1. I'm confused by Lib::Archive example https://github.com/frithnanth/perl6-Archive-Libarchive/blob/master/examples/readfile.p6 I don't see how I can get something like item 3 here

  2. There should be something like

for $file.IO.lines(gz) -> $line { or something like that in Perl6, if it exists, I can't find it.

How can I read a large file line by line without reading everything into RAM in Perl6?

like image 436
con Avatar asked Feb 21 '19 18:02

con


2 Answers

Update Now tested, which revealed an error, now fixed.

Solution #2

use Compress::Zlib;

my $file   = "data.txt.gz" ;
my $handle = try open $file or die "Error reading $file: $!" ;
my $zwrap  = zwrap($handle, :gzip) ;

for $zwrap.lines {
    .print
}

CATCH { default { die "Error reading $file: $_" } }

$handle.close ;

I've tested this with a small gzipped text file.

I don't know much about gzip etc. but figured this out based on:

  • Knowing P6;

  • Reading Compress::Zlib's README and choosing the zwrap routine;

  • Looking at the module's source code, in particular the signature of the zwrap routine our sub zwrap ($thing, :$zlib, :$deflate, :$gzip);

  • And trial and error, mainly to guess that I needed to pass the :gzip adverb.


Please comment on whether my code works for you. I'm guessing the main thing is whether it's fast enough for the large files you have.

A failed attempt at solution #5

With solution #2 working I would have expected to be able to write just:

use Compress::Zlib ;
.print for "data.txt.gz".&zwrap(:gzip).lines ;

But that fails with:

No such method 'eof' for invocant of type 'IO::Path'

This is presumably because this module was written before the reorganization of the IO classes.

That led me to @MattOates' IO::Handle like object with .lines ? issue. I note no response and I saw no related repo at https://github.com/MattOates?tab=repositories.

like image 96
raiph Avatar answered Jan 03 '23 21:01

raiph


I am focusing on the Inline::Perl5 solution that you tried.

For the call to $gz.gzreadline($_): it seems like gzreadline tries to return the line read from the zip file by modifying its input argument $_ (treated as an output argument, but it is not a true Perl 5 reference variable[1]), but the modified value is not returned to the Perl 6 script.

Here is a possoble workaround: Create a wrapper module in the curent directory, e.g. ./MyZlibWrapper.pm:

package MyZlibWrapper;
use strict;
use warnings;
use Compress::Zlib ();
use Exporter qw(import);

our @EXPORT = qw(gzopen);
our $VERSION = 0.01;

sub gzopen {
    my ( $fn, $mode ) = @_;
    my $gz = Compress::Zlib::gzopen( $fn, $mode );
    my $self = {gz => $gz}; 
    return bless $self, __PACKAGE__;
}

sub gzreadline {
    my ( $self ) = @_;
    my $line = "";
    my $res = $self->{gz}->gzreadline($line);
    return [$res, $line];
}

sub gzclose {
    my ( $self ) = @_;
    $self->{gz}->gzclose();
}    

1;

Then use Inline::Perl5 on this wrapper module instead of Compress::Zlib. For example ./p.p6:

use v6;
use lib:from<Perl5> '.';
use MyZlibWrapper:from<Perl5>;
my $file = 'data.txt.gz';
my $mode = 'rb';
my $gz = gzopen($file, $mode);
loop {
    my ($res, $line) = $gz.gzreadline();
    last if $res == 0;
    print $line;
}
$gz.gzclose();

[1] In Perl 5 you can modify an input argument that is not a reference, and the change will be reflected in the caller. This is done by modifying entries in the special @_ array variable. For example: sub quote { $_[0] = "'$_[0]'" } $str = "Hello"; quote($str) will quote $str even if $str is not passed by reference.

like image 25
Håkon Hægland Avatar answered Jan 03 '23 20:01

Håkon Hægland