Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When is white space really important in Perl6 grammars?

can someone clarify when white space is significant in rules in Perl 6 grammars? I am learning some by trial and error, but can't seem to find the actual rules in the documentation.

Example 1:

rule number {
    <pm> \d '.'? \d*[ <pm> \d* ]?
}

rule pm {
    [ '+' || '-' ]?
}

Will match a number 2.68156e+154, and not care about the spaces that are present in rule number. However, if I add a space after \d*, it will fail. (i.e. <pm> \d '.'? \d* [ <pm> \d* ]? fails).

Example 2: If I am trying to find literals in the middle of a word, then spacing around them are important. I.e., in finding the entry Double_t Delta_phi_R_1_9_pTproj_13_dat_cent_fx3001[52] = {

grammar TOP {
    ^ .*? <word-to-find> .* ?
}
rule word-to-find {
    \w*?fx\w*
}

Will find the word. However, if the definition of the rule word-to-find is changed to : fx or \w* fx\w* or \w*fx \w* then it won't make a match.

Also, then definition '[52]' will match, while the definition 'fx[52]' will not.

Thanks for any insight. A pointer to the proper point in the documentation would help greatly! Thanks,

like image 332
dave Avatar asked Feb 20 '18 18:02

dave


People also ask

Does whitespace matter in Perl?

All types of whitespace like spaces, tabs, newlines, etc. are equivalent to the interpreter when they are used outside of the quotes. A line containing only whitespace, possibly with a comment, is known as a blank line, and Perl totally ignores it.

What is Perl parsing rules?

A parsing rule is basically a set of instructions that tell our algorithm what kind of data you want to extract from your documents. Typically you will have one parsing rule for each data field inside your document.


2 Answers

In a rule, whitespace is turned into a <.ws> (that is, a non-capturing call to the ws token) except:

  • At the start of the rule, before the first atom
  • At the start of a [ (group) or ( (positional capture)
  • After ||, |, and &
  • After a variable declaration (:my $x = 'foo';)
  • After a code block
  • After the % operator for introducing a separator
  • After the ~ goal-matching operator
  • After an internal modifier (such as :i)
  • Inside of a construct like $<var> = x

Or, probably easier to remember, it will be inserted after any construct that could match some characters and after any zero-width assertion.

An important design goal in these rules is to never insert <.ws> somewhere that impedes Longest Token Matching. For example, consider rule foo:sym<ba> { [ bar | baz ] }, which is equivalent to token foo:sym<ba> { [ bar <.ws> | baz <.ws> ] <.ws> }. The default ws implementation is non-declarative (thanks to its use of <!ww>), meaning that it would break longest token matching both at the protoregex level were it inserted at the start of the rule, or at the alternation level were it inserted at the start of the group or after |.

Note that these rules only apply to rule, not to token and regex. They can be switched on at any point using :s and switched off using :!s in any of those, however (rule really just means "pretend there's a :s at the start").

Finally, the ws rule (which defaults to token ws { <!ww> \s* }) can be overridden in a grammar to define what whitespace means in the language being parsed.

like image 76
Jonathan Worthington Avatar answered Nov 07 '22 15:11

Jonathan Worthington


can someone clarify when white space is significant in rules in Perl 6 grammars?

When :sigspace is in effect.

I'll provide a little more detail below. If you or anyone else reading this needs further details, let me know via comments and I'll expand further.

First, let's eliminate one possible source of confusion, namely the meaning of the words rule and regex in the context of Perl 6, before I provide the doc link.

The word rule may be used in either a generic sense ("the regular expression, string matching and general-purpose parsing facility of Perl 6") or as a keyword (rule). Similarly, regex may be used to mean much the same thing as the generic rule or as a keyword (regex).

With that preamble out of the way, here's a link to the :sigspace doc section.

Note that the rule keyword implicitly inserts a :sigspace such that it takes effect immediately following the first atom in the declared rule, and that the effect is lexical. See @smls's answer to another SO question, especially the first two bullet points, for detailed discussion of these two important details.

You may also find my answer to another SO question dealing with whitespace/tokenization helpful.

Hth.

like image 22
raiph Avatar answered Nov 07 '22 14:11

raiph