can someone clarify when white space is significant in rules in Perl 6 grammars? I am learning some by trial and error, but can't seem to find the actual rules in the documentation.
Example 1:
rule number {
<pm> \d '.'? \d*[ <pm> \d* ]?
}
rule pm {
[ '+' || '-' ]?
}
Will match a number 2.68156e+154
, and not care about the spaces that are present in rule number
. However, if I add a space after \d*
, it will fail.
(i.e. <pm> \d '.'? \d* [ <pm> \d* ]?
fails).
Example 2:
If I am trying to find literals in the middle of a word, then spacing around them are important. I.e., in finding the entry Double_t Delta_phi_R_1_9_pTproj_13_dat_cent_fx3001[52] = {
grammar TOP {
^ .*? <word-to-find> .* ?
}
rule word-to-find {
\w*?fx\w*
}
Will find the word. However, if the definition of the rule word-to-find
is changed to :
fx
or \w* fx\w*
or \w*fx \w*
then it won't make a match.
Also, then definition '[52]'
will match, while the definition 'fx[52]'
will not.
Thanks for any insight. A pointer to the proper point in the documentation would help greatly! Thanks,
All types of whitespace like spaces, tabs, newlines, etc. are equivalent to the interpreter when they are used outside of the quotes. A line containing only whitespace, possibly with a comment, is known as a blank line, and Perl totally ignores it.
A parsing rule is basically a set of instructions that tell our algorithm what kind of data you want to extract from your documents. Typically you will have one parsing rule for each data field inside your document.
In a rule
, whitespace is turned into a <.ws>
(that is, a non-capturing call to the ws
token) except:
[
(group) or (
(positional capture)||
, |
, and &
:my $x = 'foo';
)%
operator for introducing a separator~
goal-matching operator:i
)$<var> = x
Or, probably easier to remember, it will be inserted after any construct that could match some characters and after any zero-width assertion.
An important design goal in these rules is to never insert <.ws>
somewhere that impedes Longest Token Matching. For example, consider rule foo:sym<ba> { [ bar | baz ] }
, which is equivalent to token foo:sym<ba> { [ bar <.ws> | baz <.ws> ] <.ws> }
. The default ws
implementation is non-declarative (thanks to its use of <!ww>
), meaning that it would break longest token matching both at the protoregex level were it inserted at the start of the rule, or at the alternation level were it inserted at the start of the group or after |
.
Note that these rules only apply to rule
, not to token
and regex
. They can be switched on at any point using :s
and switched off using :!s
in any of those, however (rule
really just means "pretend there's a :s
at the start").
Finally, the ws
rule (which defaults to token ws { <!ww> \s* }
) can be overridden in a grammar to define what whitespace means in the language being parsed.
can someone clarify when white space is significant in rules in Perl 6 grammars?
When :sigspace
is in effect.
I'll provide a little more detail below. If you or anyone else reading this needs further details, let me know via comments and I'll expand further.
First, let's eliminate one possible source of confusion, namely the meaning of the words rule and regex in the context of Perl 6, before I provide the doc link.
The word rule may be used in either a generic sense ("the regular expression, string matching and general-purpose parsing facility of Perl 6") or as a keyword (rule
). Similarly, regex may be used to mean much the same thing as the generic rule or as a keyword (regex
).
With that preamble out of the way, here's a link to the :sigspace
doc section.
Note that the rule
keyword implicitly inserts a :sigspace
such that it takes effect immediately following the first atom in the declared rule, and that the effect is lexical. See @smls's answer to another SO question, especially the first two bullet points, for detailed discussion of these two important details.
You may also find my answer to another SO question dealing with whitespace/tokenization helpful.
Hth.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With