I am still learning perl6, and I am reading the example on grammar from this page: http://examples.perl6.org/categories/parsers/SimpleStrings.html ; I have read the documentations on regex multiple times, but there are still some syntax that I don't understand.
token string { <quote> {} <quotebody($<quote>)> $<quote> }
Question 1: what is this "{}" in the token doing? Capture marker is <()>, and nesting structures is tilda '(' ~ ')'; but what is {} ?
token quotebody($quote) { ( <escaped($quote)> | <!before $quote> . )* }
Question 2a: escaped($quote) inside <> would be a regex function, right? And it takes $quote as an argument and returns another regex ?
Question 2b: If I want to indicate "char that is not before quote", should I use ". <!before $quote>" instead of "<!before $quote> ." ??
token escaped($quote) { '\\' ( $quote | '\\' ) } # I think this is a function;
TL;DR @briandfoy has provided an easy to digest answer. But here be dragons that he didn't mention. And pretty butterflies too. This answer goes deep.
Question 1: what is this
{}
in the token doing?
It's a code block1,2,3,4.
It's an empty one and has been inserted purely to force the $<quote>
in quotebody($<quote>)
to evaluate to the value captured by the <quote>
at the start of the regex.
The reason why $<quote>
does not contain the right value without insertion of a code block is a Rakudo Perl 6 compiler limitation or bug related to "publication of match variables".
Moritz Lenz states in a Rakudo bug report that "the regex engine doesn't publish match variables unless it is deemed necessary".
By "regex engine" he means the regex/grammar engine in NQP, part of the Rakudo Perl 6 compiler.3
By "match variables", he means the variables that store captures of match results:
the current match variable $/
;
the numbered sub-match variables $0
, $1
, etc.;
named sub-match variables of the form $<foo>
.
By "publish" he means that the regex/grammar engine does what it takes so that any mentions of any variables in a regex (a token is also a regex) evaluate to the values they're supposed to have when they're supposed to have them. Within a given regex, match variables are supposed to contain a Match
object corresponding to what has been captured for them at any given stage in processing of that regex, or Nil
if nothing has been captured.
By "deemed necessary" he means that the regex/grammar engine makes a conservative call about whether it's worth doing the publication work after each step in the matching process. By "conservative" I mean that the engine often avoids doing publication, because it slows things down and is usually unnecessary. Unfortunately it's sometimes too optimistic about when publication is actually necessary. Hence the need for programmers to sometimes intervene by explicitly inserting a code block to force publication of match variables (and other techniques for other variables5). It's possible that the regex/grammar engine will improve in this regard over time, reducing the scenarios in which manual intervention is necessary. If you wish to help progress this, please create test cases that matter to you for existing related bugs.5
$<quote>
's valueThe named capture $<quote>
is the case in point here.
As far as I can tell, all sub-match variables correctly refer to their captured value when written directly into the regex without a surrounding construct. This works:
my regex quote { <['"]> }
say so '"aa"' ~~ / <quote> aa $<quote> /; # True
I think6$<quote>
gets the right value because it is parsed as a regex slang construct.4
In contrast, if the {}
were removed from
token string { <quote> {} <quotebody($<quote>)> $<quote> }
then the $<quote>
in quotebody($<quote>)
would not contain the value captured by the opening <quote>
.
I think this is because the $<quote>
in this case is parsed as a main slang construct.
Question 2a:
escaped($quote)
inside<>
would be a regex function, right? And it takes$quote
as an argument
That's a good first approximation.
More specifically, regex atoms of the form <foo(...)>
are calls of the method foo
.
All regexes -- whether declared with token
, regex
, rule
, /.../
or any other form -- are methods. But methods declared with method
are not regexes:
say Method ~~ Regex; # False
say WHAT token { . } # (Regex)
say Regex ~~ Method; # True
say / . / ~~ Method; # True
When the <escaped($quote)>
regex atom is encountered, the regex/grammar engine doesn't know or care if escaped
is a regex or not, nor about the details of method dispatch inside a regex or grammar. It just invokes method dispatch, with the invocant set to the Match
object that's being constructed by the enclosing regex.
The call yields control to whatever ends up running the method. It typically turns out that the regex/grammar engine is just recursively calling back into itself because typically it's a matter of one regex calling another. But it isn't necessarily so.
and returns another regex
No, a regex atom of the form <escaped($quote)>
does not return another regex.
Instead it calls a method that will/should return a Match
object.
If the method called was a regex, P6 will make sure the regex generates and populates the Match
object automatically.
If the method called was not a regex but instead just an ordinary method, then the method's code should have manually created and returned a Match
object. Moritz shows an example in his answer to the SO question Can I change the Perl 6 slang inside a method?.
The Match
object is returned to the "regex/grammar engine" that drives regex matching / grammar parsing.3
The engine then decides what to do next according to the result:
If the match was successful, the engine updates the overall match object corresponding to the calling regex. The updating may include saving the returned Match
object as a sub-match capture of the calling regex. This is how a match/parse tree gets built.
If the match was unsuccessful, the engine may backtrack, undoing previous updates; thus the parse tree may dynamically grow and shrink as matching progresses.
Question 2b: If I want to indicate "char that is not before quote", should I use
. <!before $quote>
instead of<!before $quote> .
??
Yes.
But that's not what's needed for the quotebody
regex, if that's what you're talking about.
While on the latter topic, in @briandfoy's answer he suggests using a "Match ... anything that's not a quote" construct rather than doing a negative look ahead (<!before $quote>
). His point is that matching "not a quote" is much easier to understand than "are we not before a quote? then match any character".
However, it is by no means straight-forward to do this when the quote is a variable whose value is set to the capture of the opening quote. This complexity is due to bugs in Rakudo. I've worked out what I think is the simplest way around them but think it likely best to just stick with use of <!before $quote> .
unless/until these long-standing Rakudo bugs are fixed.5
token escaped($quote) { '\\' ( $quote | '\\' ) } # I think this is a function;
It's a token, which is a Regex
, which is a Method
, which is a Routine
:
say token { . } ~~ Regex; # True
say Regex ~~ Method; # True
say Method ~~ Routine; # True
The code inside the body (the { ... }
bit) of a regex (in this instance the code is the lone .
in token { . }
, which is a regex atom that matches a single character) is written in the P6 regex "slang" whereas the code used inside the body of a method
routine is written in the main P6 "slang".4
~
The regex tilde (~
) operator is specifically designed for the sort of parsing in the example this question is about. It reads better inasmuch as it's instantly recognizable and keeps the opening and closing quotes together. Much more importantly it can provide a human intelligible error message in the event of failure because it can say what closing delimiter(s) it's looking for.
But there's a key wrinkle you must consider if you insert a code block in a regex (with or without code in it) right next to the regex ~
operator (on either side of it). You will need to group the code block unless you specifically want the tilde to treat the code block as its own atom. For example:
token foo { <quote> ~ $<quote> {} <quotebody($<quote>) }
will match a pair of <quote>
s with nothing between them. (And then try to match <quotebody...>
.)
In contrast, here's a way to duplicate the matching behavior of the string
token in the String::Simple::Grammar
grammar:
token string { <quote> ~ $<quote> [ {} <quotebody($<quote>) ] }
1 In 2002 Larry Wall wrote "It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.". Computer scientists note that you can't have procedural code in the middle of a traditional regular expression. But Perls long ago led the shift to non-traditional regexes and P6 has arrived at the logical conclusion -- a simple {...}
is all it takes to insert arbitrary procedural code in the middle of a regex. The language design and regex/grammar engine implementation3 ensure that traditional style purely declarative regions within a regex are recognized, so that formal regular expression theory and optimizations can be applied to them, but nevertheless arbitrary regular procedural code can also be inserted. Simple uses include matching logic and debugging. But the sky's the limit.
2 The first procedural element of a regex, if any, terminates what's called the "declarative prefix" of the regex. A common reason for inserting an empty code block ({}
) is to deliberately terminate a regex's declarative prefix when that provides the desired matching semantics for a given longest alternation in a regex. (But that isn't the reason for its inclusion in the token you're trying to understand.)
3 Loosely speaking, the regex / grammar engine in NQP is to P6 what PCRE is to P5.
A key difference is that the regex language, along with its associated regex/grammar engine, and the main language it cooperates with, which in the case of Rakudo is Perl 6, are co-equals control-wise. This is an implementation of Larry Wall's original 2002 vision for integration between regexes and "rich languages". Each language/run-time can call into the other and communicate via high level FFIs. So they can appear to be, can behave as, and indeed are, a single system of cooperating languages and cooperating run-times.
(The P6 design is such that all languages can be explicitly designed, or be retro-fitted, to cooperate in a "rich" manner via two complementary P6 FFIs: the metamodel FFI 6model and/or the C calling convention FFI NativeCall.)
4 The P6 language is actually a collection of sub-languages -- aka slangs -- that are used together. When you are reading or writing P6 code you are reading or writing source code that starts out in one slang but has sections written in others. The first line in a file uses the main slang. Let's say that's analogous to English. Regexes are written in another slang; let's say that's like Spanish. So in the case of the grammar String::Simple::Grammar
, the code begins in English (the use v6;
statement), then recurses into Spanish (after the {
of rule TOP {
), i.e. the ^ <string> $
bit, and then returns back out into English (the comment starting # Note ...
). Then it recurses back into Spanish for <quote> {} <quotebody($<quote>)> $<quote>
and in the middle of that Spanish, at the {}
codeblock, it recurses into another level of English again. So that's English within Spanish within English. Of course, the code block is empty, so it's like writing/reading nothing in English and then immediately dropping back into Spanish, but it's important to understand that this recursive stacking of languages/run-times is how P6 works, both as a single overall language/run-time and when cooperating with other non-P6 languages/run-times.
5 I encountered several bugs, listed at the end of this footnote, in the process of applying two potential improvements. (Both mentioned in briandfoy's answer and this one.) The two "improvements" are use of the ~
construct, and a "not a quote" construct instead of using <!before foo> .
. The final result, plus mention of pertinent bugs:
grammar String::Simple::Grammar {
rule TOP {^ <string> $}
token string {
:my $*not-quote;
<quote> ~ $<quote>
[
{ $*not-quote = "<-[$<quote>]>" }
<quotebody($<quote>)>
]
}
token quote { '"' | "'" }
token quotebody($quote) { ( <escaped($quote)> | <$*not-quote> )* }
token escaped($quote) { '\\' ( $quote | '\\' ) }
}
If anyone knows of a simpler way to do this, I'd love to hear about it in a comment below.
I ended up searching the RT bugs database for all regex bugs. I know SO isn't bug database but I think it's reasonable for me to note the following ones. Aiui the first two directly interact with the issue of publication of match variables.
"the < >
regex call syntax looks up lexicals only in the parent scope of the regex it is used in, and not in the scope of the regex itself." rt #127872
Backtracking woes as they relate to passing arguments in regex calls
It looks like there are lots of nasty threading bugs. Most boil down to the fact that several regex features use EVAL
behind the scenes and EVAL
is not yet thread-safe. Fortunately the official doc mentions these.
Can't do recursive grammars due to .parse
setting $/
.
6 This question and my answer has pushed me to the outer limits of my understanding of an ambitious and complex aspect of P6. I plan to soon gain greater insight into the precise interactions between nqp and full P6, and the hand-offs between their regex slangs and main slangs, as discussed in footnotes above. (My hopes currently largely rest on having just bought commaide.) I'll update this answer if/when I have some results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With