How does perl6 decide which proto
token
to match against first?
The below code works as expected, it matches the string 1234
, and Grammar::Tracer
shows that the first token matched against is s:sym<d>
, which makes sense since it's the longest token.
However, if I changed a literal to token, for example, changing token
three
form '3'
to <digit>
, it fails to match, and Grammar::Tracer
shows that s:sym<b>
is being matched against first.
Moving s:sym<d>
to the top, matches the string in both cases, but what is the explanation for that behavior?
#!/usr/bin/env perl6
no precompilation;
use Grammar::Tracer;
grammar G {
token TOP { <s> }
proto token s { * }
token s:sym<a> { <one> }
token s:sym<b> { <one> <two> }
token s:sym<c> { <one> <two> <three> }
token s:sym<d> { <one> <two> <three> <four> }
token one { '1' }
token two { '2' }
token three { '3' }
token four { '4' }
}
my $g = G.new;
say $g.parse: '1234';
# Output: Match
# token three { '3' }
TOP
| s
| | s:sym<d>
| | | one
# Output No Match
# token three { <digit> }
TOP
| s
| | s:sym<b>
| | | one
How does perl6 decide which proto token to match against first?
It uses "Longest alternation" logic. In your (well presented!) case the relevant deciding factors are as follows.
First, select the branch which has the longest declarative prefix.
So the first thing to focus on is that it's not "longest token" but longest declarative prefix, the start of a pattern that contains nothing but contiguous "declarative" "atoms".
A 3
is a declarative atom.
A <foo>
may or may not be; it depends on what it includes.
I haven't found clear official documentation identifying which built in patterns are declarative and which are not but it looks like all the ones declared with a slash, eg \d
, are declarative whereas all the ones declared in the form form <foo>
, eg <digit>
, are not.
(Note in particular that the built in <ws>
pattern is not declarative. Given that spaces after atoms in rules
are converted to <ws>
, this means the first such space terminates the declarative prefix of that rule.)
So a <digit>
atom isn't part of a declarative prefix and instead terminates a prefix.
Moving
s:sym<d>
to the top, matches the string in both cases, but what is the explanation for that behavior?
Because with the change of <three>
to call <digit>
you've changed your rules to have three that tie for longest declarative prefix (<one> <two>
). So other tie-breaking rules are used.
If all else fails in those tie-breaking rules to pick a winner, then the last "left-most" rule is selected, which, ignoring inheritance, means the rule that comes first lexically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With