I have a multi-GB file to process in Perl. Reading the file line-by-line takes several minutes; reading it into a scalar via File::Slurp takes a couple of seconds. Good. Now, what is the most efficient way to process each "line" of the scalar? I imagine that I should avoid modifying the scalar, e.g. lopping off each successive line as I process it, to avoid reallocating the scalar.
I tried this:
use File::Slurp;
my $file_ref = read_file( '/tmp/tom_timings/tom_timings_15998', scalar_ref => 1 ) ;
for my $line (split /\n/, $$file_ref) {
# process line
}
And it's sub-minute: adequate but not great. Is there a faster way to do this? (I have more memory than God.)
split
should be very fast unless you start swapping. The only way I can see to speed it up is to write an XS function that looks for LF rather than use a regex.
As an aside, you could save a lot of memory by switching to
while ($$file_ref =~ /\G([^\n]*\n|[^\n]+)/g) {
my $line = $1;
# process line
}
Said XS function. Move the newSVpvn_flags
line after the if
statement if you don't want to chomp.
SV* next_line(SV* buf_sv) {
STRLEN buf_len;
const char* buf = SvPV_force(buf_sv, buf_len);
char* next_line_ptr;
char* buf_end;
SV* rv;
if (!buf_len)
return &PL_sv_undef;
next_line_ptr = buf;
buf_end = buf + buf_len;
while (next_line_ptr != buf_end && *next_line_ptr != '\n')
++next_line_ptr;
rv = newSVpvn_flags(buf, next_line_ptr-buf, SvUTF8(buf_sv) ? SVf_UTF8 : 0);
if (next_line_ptr != buf_end)
++next_line_ptr;
sv_chop(buf_sv, next_line_ptr);
return rv; /* Typemap will mortalize */
}
Means of testing it:
use strict;
use warnings;
use Inline C => <<'__EOC__';
SV* next_line(SV* buf_sv) {
...
}
__EOC__
my $s = <<'__EOI__';
foo
bar
baz
__EOI__
while (defined($_ = next_line($s))) {
print "<$_>\n";
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With