Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

File::Slurp into a multi-GB scalar - how to split efficiently?

I have a multi-GB file to process in Perl. Reading the file line-by-line takes several minutes; reading it into a scalar via File::Slurp takes a couple of seconds. Good. Now, what is the most efficient way to process each "line" of the scalar? I imagine that I should avoid modifying the scalar, e.g. lopping off each successive line as I process it, to avoid reallocating the scalar.

I tried this:

use File::Slurp;
my $file_ref = read_file( '/tmp/tom_timings/tom_timings_15998', scalar_ref => 1  ) ;

for my $line (split /\n/, $$file_ref) {
    # process line
}

And it's sub-minute: adequate but not great. Is there a faster way to do this? (I have more memory than God.)

like image 341
Chap Avatar asked Feb 12 '14 17:02

Chap


1 Answers

split should be very fast unless you start swapping. The only way I can see to speed it up is to write an XS function that looks for LF rather than use a regex.

As an aside, you could save a lot of memory by switching to

while ($$file_ref =~ /\G([^\n]*\n|[^\n]+)/g) {
    my $line = $1;
    # process line
}

Said XS function. Move the newSVpvn_flags line after the if statement if you don't want to chomp.

SV* next_line(SV* buf_sv) {
    STRLEN buf_len;
    const char* buf = SvPV_force(buf_sv, buf_len);
    char* next_line_ptr;
    char* buf_end;
    SV* rv;

    if (!buf_len)
        return &PL_sv_undef;

    next_line_ptr = buf;
    buf_end = buf + buf_len;
    while (next_line_ptr != buf_end && *next_line_ptr != '\n')
        ++next_line_ptr;

    rv = newSVpvn_flags(buf, next_line_ptr-buf, SvUTF8(buf_sv) ? SVf_UTF8 : 0);

    if (next_line_ptr != buf_end)
        ++next_line_ptr;

    sv_chop(buf_sv, next_line_ptr);
    return rv;  /* Typemap will mortalize */
}

Means of testing it:

use strict;
use warnings;

use Inline C => <<'__EOC__';

SV* next_line(SV* buf_sv) {
    ...
}

__EOC__

my $s = <<'__EOI__';
foo
bar
baz
__EOI__

while (defined($_ = next_line($s))) {
   print "<$_>\n";
}
like image 122
ikegami Avatar answered Oct 17 '22 10:10

ikegami