Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

perl6 Regex subrules and named regex MUCH MUCH slower than explicit regex; how to make them equally fast?

I have a data file with 1608240 lines. The file is in sections. Each section has a unique word in the begin line, all sections have the same word "doneSection" in the last line of the section.

I am trying to fish out some sections by doing the following (code reformatted by @raiph from original post, to make code easier to interpret):

# using named subrules/regex is EXTREMELY slow;
# it reads about 2 lines per second, and grinds to halt
# after about 500 lines: (>> is the right word boundary)
perl6 -e 'my regex a { [ <{<iron copper carbon>.join("||")}> ] };
          my $x = 0;
          for "/tmp/DataRaw".IO.lines {
            $*ERR.print( "$x 1608240 \r" );
            ++$x;
            .say if m/:i beginSection \s+ <a> >>/ or
                    (m/:i \s+ <a> \s+ /
                     ff
                     m/:i doneSection/);
          }'

# however, if I explicitly write out the regex instead of using a subrule,
# it reads about 1000 lines per second, and it gets the job done:
perl6 -e 'my $x = 0;
          for "/tmp/DataRaw".IO.lines {
            $*ERR.print( "$x 1608240 \r" );
            ++$x;
            .say if m/:i beginSection \s+
                         [ iron || copper || carbon ] >>/ or
                    (m/:i \s+
                         [ iron || copper || carbon ] \s+ /
                     ff
                     m/:i doneSection/);
          }'

My question is, how to make subrule as fast as explicit regex, or at least not grind to a halt? I prefer using higher level of abstraction. Is this a regex engine memory problem? I have also tried using:

my $a=rx/ [ <{ < iron copper carbon > .join("||") }> ] /

and it is equally slow.

I cannot post the 1.6 million line of my data file, but you can probably generate a similar file for testing purposes.

Thanks for any hints.

like image 373
lisprogtor Avatar asked Feb 09 '19 09:02

lisprogtor


People also ask

What is \s in Perl regex?

The substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is − s/PATTERN/REPLACEMENT/; The PATTERN is the regular expression for the text that we are looking for.

What is the use \w in Perl?

Use \w+ to match a string of Perl-identifier characters (which isn't the same as matching an English word). If use locale is in effect, the list of alphabetic characters generated by \w is taken from the current locale.

What does \s+ mean in Perl?

(\S+) | will match and capture any number (one or more) of non-space characters, followed by a space character (assuming the regular expression isn't modified with a /x flag). In both cases, these constructs appear to be one component of an alternation.

What does s mean in Perl?

Substitution Operator or 's' operator in Perl is used to substitute a text of the string with some pattern specified by the user. Syntax: s/text/pattern.


1 Answers

The problem isn't use of subrules / naming regexes. It's what's inside the regex. It's:

[ <{<iron copper carbon>.join("||")}> ]

vs

[ iron || copper || carbon ]

The following should eliminate the speed difference. Please try it and comment on your results:

my regex a { || < iron copper carbon > }

Note the leading whitespace in < iron copper ... rather than <iron copper ...>. The latter means a subrule called iron with the arguments copper etc. The former means a "quotewords" list literal just as it does in the main language (though the leading whitespace is optional in the main language).1

The list of matchers can be put in an array variable:

my @matchers = < iron copper carbon >;
my regex a { || @matchers }

The matchers in @matchers can be arbitrary regexes not just strings:

my @matchers = / i..n /, / cop+er /, / carbon /;
my regex a { || @matchers }

Warning: The above works but while writing this answer I encountered and have now golf'd the issue that @ symbol'd array interpolation doesn't backtrack.

how to make subrule as fast as explicit regex

It's not about it being explicit. It's about regex interpolation that involves run-time evaluation.

In general, P6 regexes are written in their own regex language1 that is compiled at compile-time by default.

But the P6 regex language includes the ability to inject code that is then evaluated at run-time (provided it's not dangerous).2

This can be useful but incurs run-time overhead which can sometimes be significant.

(It's also possible you've got some bad Big O algorithmic peformance going on related to your use of the run-time evaluation. If so it becomes even worse than just run-time interpolation because it's then a Big O problem. I've not bothered to analyze that because it's best just to use fully compiled regexes as per my code above.)

I have also tried using:

my $a=rx/ [ <{ < iron copper carbon > .join("||") }> ] /

That still doesn't avoid run-time interpolation. This construct:

<{ ...  }>

interpolates by evaluating the code inside the braces at run-time and then injecting that into the outer regex.

Footnotes

1 The P6 "language" is actually an interwoven collection of DSLs.

2 Unless you explicitly write a use MONKEY-SEE-NO-EVAL; (or just use MONKEY;) pragma to take responsibility for injection attacks, the interpolation of a regex containing injected strings is limited at compile-time to ensure injection attacks aren't possible and P6 will refuse to run the code if it is. The code you've written isn't subject to attacks so the compiler let you write it as you have done and compiled the code without fuss.

like image 151
raiph Avatar answered Sep 20 '22 09:09

raiph