perl6 Regex subrules and named regex MUCH MUCH slower than explicit regex; how to make them equally fast?

Tags:

I have a data file with 1608240 lines. The file is in sections. Each section has a unique word in the begin line, all sections have the same word "doneSection" in the last line of the section.

I am trying to fish out some sections by doing the following (code reformatted by @raiph from original post, to make code easier to interpret):

# using named subrules/regex is EXTREMELY slow;
# it reads about 2 lines per second, and grinds to halt
# after about 500 lines: (>> is the right word boundary)
perl6 -e 'my regex a { [ <{<iron copper carbon>.join("||")}> ] };
          my $x = 0;
          for "/tmp/DataRaw".IO.lines {
            $*ERR.print( "$x 1608240 \r" );
            ++$x;
            .say if m/:i beginSection \s+ <a> >>/ or
                    (m/:i \s+ <a> \s+ /
                     ff
                     m/:i doneSection/);
          }'

# however, if I explicitly write out the regex instead of using a subrule,
# it reads about 1000 lines per second, and it gets the job done:
perl6 -e 'my $x = 0;
          for "/tmp/DataRaw".IO.lines {
            $*ERR.print( "$x 1608240 \r" );
            ++$x;
            .say if m/:i beginSection \s+
                         [ iron || copper || carbon ] >>/ or
                    (m/:i \s+
                         [ iron || copper || carbon ] \s+ /
                     ff
                     m/:i doneSection/);
          }'

My question is, how to make subrule as fast as explicit regex, or at least not grind to a halt? I prefer using higher level of abstraction. Is this a regex engine memory problem? I have also tried using:

my $a=rx/ [ <{ < iron copper carbon > .join("||") }> ] /

and it is equally slow.

I cannot post the 1.6 million line of my data file, but you can probably generate a similar file for testing purposes.

Thanks for any hints.

373

asked Feb 09 '19 09:02

lisprogtor

1 Answers

The problem isn't use of subrules / naming regexes. It's what's inside the regex. It's:

[ <{<iron copper carbon>.join("||")}> ]

[ iron || copper || carbon ]

The following should eliminate the speed difference. Please try it and comment on your results:

my regex a { || < iron copper carbon > }

Note the leading whitespace in < iron copper ... rather than <iron copper ...>. The latter means a subrule called iron with the arguments copper etc. The former means a "quotewords" list literal just as it does in the main language (though the leading whitespace is optional in the main language).¹

The list of matchers can be put in an array variable:

my @matchers = < iron copper carbon >;
my regex a { || @matchers }

The matchers in @matchers can be arbitrary regexes not just strings:

my @matchers = / i..n /, / cop+er /, / carbon /;
my regex a { || @matchers }

Warning: The above works but while writing this answer I encountered and have now golf'd the issue that @ symbol'd array interpolation doesn't backtrack.

how to make subrule as fast as explicit regex

It's not about it being explicit. It's about regex interpolation that involves run-time evaluation.

In general, P6 regexes are written in their own regex language¹ that is compiled at compile-time by default.

But the P6 regex language includes the ability to inject code that is then evaluated at run-time (provided it's not dangerous).²

This can be useful but incurs run-time overhead which can sometimes be significant.

(It's also possible you've got some bad Big O algorithmic peformance going on related to your use of the run-time evaluation. If so it becomes even worse than just run-time interpolation because it's then a Big O problem. I've not bothered to analyze that because it's best just to use fully compiled regexes as per my code above.)

I have also tried using:

my $a=rx/ [ <{ < iron copper carbon > .join("||") }> ] /

That still doesn't avoid run-time interpolation. This construct:

<{ ...  }>

interpolates by evaluating the code inside the braces at run-time and then injecting that into the outer regex.

Footnotes

¹ The P6 "language" is actually an interwoven collection of DSLs.

² Unless you explicitly write a use MONKEY-SEE-NO-EVAL; (or just use MONKEY;) pragma to take responsibility for injection attacks, the interpolation of a regex containing injected strings is limited at compile-time to ensure injection attacks aren't possible and P6 will refuse to run the code if it is. The code you've written isn't subject to attacks so the compiler let you write it as you have done and compiled the code without fuss.

151

answered Sep 20 '22 09:09

raiph

Related questions
                            
                                Why replaceFirst and replaceAll give different results?
                            
                                Regular expression with javascript
                            
                                express.js routes explanation
                            
                                Detecting if two regexes could possibly match the same string [duplicate]
                            
                                Regular expression matching emoji in Mac OS X / iOS
                            
                                Puzzled by use of .{1} in regex
                            
                                Getting text around a specific element reference
                            
                                vscode regex sub match evaluate instead of concatenate?
                            
                                Formatting camel case to readable in PHP while skipping abbreviations
                            
                                VSCode deletes `\` on save from my regex pattern [duplicate]
                            
                                Black --exclude argument not excluding desired file(s)
                            
                                Is this the RegEx for matching any cell reference in an Excel formula?
                            
                                Is_prime function via regex in python (from perl)
                            
                                Can sed regex simulate lookbehind and lookahead?
                            
                                Is there a reason for python regex not to compile r'(\s*)+'?
                            
                                Designing a Regex to find any Noun Phrase
                            
                                Match URIs with <data> like http://example.com/something in AndroidManifest
                            
                                Ajax function issue on return true and false in wordpress
                            
                                Perl6: Capturing Windows newline in a string with regex
                            
                                python regular expression: match either one of several regular expressions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

perl6 Regex subrules and named regex MUCH MUCH slower than explicit regex; how to make them equally fast?

Tags:

performance

regex

raku

lisprogtor

People also ask

1 Answers

Footnotes

raiph

Recent Activity

Donate For Us