Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

<.ident> function/capture in perl6 grammars

While reading the Xml grammar for perl6 (https://github.com/supernovus/exemel/blob/master/lib/XML/Grammar.pm6), I am having some difficulties understanding the following token.

token pident {
  <!before \d> [ \d+ <.ident>* || <.ident>+ ]+ % '-'
}

More specifically <.ident>, there are no other definitions of ident, so I am assuming it is a reserved term. Though i cant find find a proper definition on perl6.org. Does anyone know what this means?

like image 920
Mikkel Avatar asked Jun 04 '18 08:06

Mikkel


2 Answers

TL;DR I'll start with a precise and relatively concise answer. The rest of this answer is for those wanting to know more about built in rules in general and/or to drill down into ident in particular.

<.ident> function/capture

Because of the ., <.ident> only matches, it doesn't capture[1]. For the rest of this answer I'll generally omit the . because it makes no difference to a rule's meaning besides the capture aspect.

Just as you can invoke (aka "call") one function within the declaration of another in programming languages, so too you can invoke a rule/token/regex/method (hereafter I'll generally just use the term "rule") within the declaration of another rule. <foo> is the syntax used to invoke a rule named foo; so <ident> invokes a (method) namedident.

At the time I write this, XML::Grammar grammar does not itself define/declare a rule named ident. That means the call ends up dispatched to a built in declaration with that name.

The built in ident rule does precisely the same as if it were declared as:

token ident {
    [ <alpha> ]
    [ <alnum> ]*
}

The official Predefined character classes doc should provide precise definitions of <alpha> and <alnum>. Alternatively, the relevant details are also included later on in this answer.

The bottom line is that ident matches a string of one or more "alphanumeric" characters except that the first character cannot be a "number".

Thus both abc or def123 match whereas 123abc does not.

The rest of this answer

For those interested in detail worth knowing I've written the following sections:

  • Raku (standard language and class details)

  • Rakudo (high level implementation)

  • NQP (mid level implementation)

  • MoarVM (low level implementation)

  • The specification and "specification" of ident

  • (Corrections of) documentation of <ident>, "character class" and "identifier"

  • ident vs Raku identifiers

Raku (standard language and class details)

XML::Grammar is a user defined Raku grammar. A Raku grammar is a class. ("Grammars are really just slightly specialized classes".)

A Raku rule is a regex is a method:

grammar foo { rule ident { ... } }

say foo.^lookup('ident').WHAT; # (Regex)
say Regex ~~ Method;           # True

A rule call, like <ident>, in a grammar, is typically invoked as a result of calling .parse or similar on the grammar. The .parse call matches the input string according to the rules in the grammar.

When an occurrence of <ident> within XML::Grammar is evaluated during a match, the result is an ident method (rule) call on an instance of XML::Grammar (the .parse call creates an instance of its invocant if it's just a type object).

Because XML::Grammar does not itself define a rule/method of that name, the ident call is instead dispatched according to standard method resolution, er, rules. (I'm using the word "rules" here in the generic non-Raku specific sense. Ah, language.)

In Raku, any class created using a declaration of the form grammar foo { ... } automatically inherits from the Grammar class which in turn inherits from the Match class:

say .^mro given grammar foo {} # ((foo) (Grammar) (Match) (Capture) (Cool) (Any) (Mu))

ident is found in the built in Match class.

Rakudo (high level implementation)

In the Rakudo compiler, the Match class does the role NQPMatchRole.

This NQPMatchRole is where the highest level implementation of ident is found.

NQP (mid level implementation)

NQPMatchRole is written in the nqp language, a subset of Raku used to bootstrap the full Raku, and the heart of NQP, a compiler toolkit.

Excerpting and reformatting just the most salient code from the ident declaration, the match for the first character boils down to:

   nqp::ord($target, $!pos) == 95
|| nqp::iscclass(nqp::const::CCLASS_ALPHABETIC, $target, $!pos)

This matches if the first character is either a _ (95 is the ASCII code / Unicode codepoint for an underscore) or a character matching a character class defined in NQP called CCLASS_ALPHABETIC.

The other bit of salient code is:

nqp::findnotcclass( nqp::const::CCLASS_WORD

This matches zero or more subsequent characters in the character class CCLASS_WORD.

A search of NQP for CCLASS_ALPHABETIC shows several matches. The most useful seems to be an NQP test file. While this file makes it clear that CCLASS_WORD is a superset of CCLASS_ALPHABETIC, it doesn't make it clear what those classes actually match.

NQP targets multiple "backends" or concrete virtual machines. Given the relative paucity of Rakudo/NQP doc/tests of what these rules and character classes actually match, one has to look at one of its backends to verify what's what.

MoarVM (low level implementation)

MoarVM is the only officially supported backend.

A search of MoarVM for CCLASS shows several matches.

The important one seems to be ops.c which includes a switch (cclass) statement which in turn includes cases for MVM_CCLASS_ALPHABETIC and MVM_CCLASS_WORD that correspond to NQP's similarly named constants.

According to the code's comments:

CCLASS_ALPHABETIC currently matches exactly the same characters as the full Raku or NQP <:L> rule, i.e. the characters Unicode has classified as "Letters".

I think that means <alpha> is equivalent to the union of CCLASS_ALPHABETIC and _ (underscore).

CCLASS_WORD matches the same plus <:Nd>, i.e. decimal digits (in any human language, not just English).

I think that means the Raku / NQP <alnum> rule is equivalent to CCLASS_WORD.

The specification and "specification" of ident

The official specification of Raku is embodied in roast[2].

A search of roast for ident shows several matches.

Most use <ident> only incidentally, as part of testing something else. The specification requires that they work as shown, but you won't understand what <ident> is supposed to do by looking at incidental usage.

Three tests clearly test <ident> itself. One of those is essentially redundant, leaving two. I see no changes between the 6.c and 6.c.errata versions of these two matches:

From S05-mass/rx.t:

ok ('2+3 ab2' ~~ /<ident>/) && matchcheck($/, q/mob<ident>: <ab2 @ 4>/), 'capturing builtin <ident>';

ok tests that its first argument returns True. This call tests that <ident> skips 2+3 and matches ab2.

From S05-mass/charsets.t:

is $latin-chars.comb(/<ident>/).join(" "), "ABCDEFGHIJKLMNOPQRSTUVWXYZ _ abcdefghijklmnopqrstuvwxyz ª µ º ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö øùúûüýþÿ", 'ident chars';

is tests that its first argument matches its second. This call tests what the ident rule matches from a string consisting of the first 256 Unicode codepoints (the Latin-1 character set).

Here's a variation of this test that more clearly shows the matching that happens:

say ~$_ for $latin-chars ~~ m:g/<ident>/;

prints:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
_
abcdefghijklmnopqrstuvwxyz
ª
µ
º
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
øùúûüýþÿ

But <ident> will match a whole lot more than just a hundred or so characters from Latin-1. So, while the above tests cover what <ident> is officially specified/tested to match, they clearly don't cover the full picture.

So let's look at the official speculation that may, with care, be considered related to "specification".

First, we note the warning at the top:

Note: these documents may be out of date.
For Perl 6 documentation see docs.perl6.org;
for specs, see the official test suite.

The term "specs" in this warning is short for "specification". As already explained, the official specification test suite is roast, not any human language verbiage.

(Some people still think of these historical design docs as "specifications" too, and refer to them as "specs", but the official view is that "specs", as applied to the design docs, should be considered to be short for "speculations" to emphasize that they are not something to be fully relied upon.)

A search for ident in design.raku.org shows several matches.

The most useful match is in the Predefined Subrules section of S05:

These are some of the predefined subrules for any grammar or regex:

  • ident ... Match an identifier.

Uhoh...

(Corrections of) documentation of <ident>, "character class" and "identifier"

From Predefined character classes in the official doc:

    Class                             Description
    <ident>                           Identifier. Also a default rule.

This is misleading in three ways:

  • ident is not a character class. Character classes match a single character in that character class; if used with a quantifier they just match a string of such characters, each of which can be any character from that class. In contrast <ident> matches a particular pattern of characters. It may be one character but you can't control that; the rule is greedy, matching as many characters fit the pattern. If you apply a quantifier it controls repetition of the overall rule, not how many characters are included in a single match of the rule.

  • All built in rules are default rules. I think the default comment is there to emphasize that you can write your own ident rule if you don't like the built-in pattern. This is true for all rules though it will typically make much less sense to override built ins such as canonical character classes like <lower> (lowercase).

  • ident does not match identifiers! Or, more accurately, it doesn't do so on its own for most Raku identifiers. See the next section for the details.

ident vs Raku identifiers

my @Identifiers = < $bar %hash Foo Foo::Bar your_ident anothers' my-ident >; 
say (~$/ if m/^<ident>$/ for @Identifiers); # (Foo your_ident)
say (~$/ if m/ <ident> / for @Identifiers); # (bar hash Foo Foo your_ident anothers my)

In nqp's grammar, which is defined in NQP's Grammar.nqp, there's:

token identifier { <.ident> [ <[\-']> <.ident> ]* }

In Raku's grammar, which is defined in Rakudo's Grammar.nqp, there's code that looks slightly different but has the exact same effect:

token apostrophe { <[ ' \- ]> }
token identifier { <.ident> [ <.apostrophe> <.ident> ]* }

So <identifier> matches a pattern that includes one or more <ident>s with <apostrophe>s in between.

The ident method is in NQPMatchRole, which means it's a built-in that's part of the rule namespace of users' grammars.

But the identifier methods are not exported by either Raku or nqp. So they are not part of the rule namespace of users' grammars.

If we write our own indentifier token we can see it in action:

my token identifier { <.ident> [ <[\-']> <.ident> ]* }
my token sigil { <[$@%&]> }
say (~$/ if m/^ <sigil>? <identifier> $/ for @Identifiers)

displays:

($bar %hash Foo your_ident my-ident)

To summarize the above and some other considerations:

  • <ident> matches just parts of what <identifier> matches (though they're the same for simple names). Consider is-prime. This is a Raku identifier but contains two <ident> matches (is and prime).

  • <identifier> matches just parts of "Raku identifiers" (though they're the same for simple names). Consider infix:<+>. This is sometimes referred to as a Raku identifier but requires both an <identifier> match and a match of :<+>.

  • Raku identifiers are themselves just parts of names (though they're the same for the simplest names). Consider Foo-Bar::Baz-Qux which contains two <identifier> matches (each in turn containing two <ident> matches).

Footnotes

[1] If you're not sure what a capture is, see Capturing, Named captures and Subrules.

[2] The official specification of Raku is a test suite called roast -- the Repository Of All Specification Tests. The latest version of a specific branch of roast defines a specific version of Raku. When I first wrote this answer there had only been two official branches/versions of roast, and therefore of Raku. The first was 6.c aka 6.Christmas. This was cut on Christmas day 2015 and has been deliberately left frozen since that day. The second was 6.c.errata, which conservatively added corrections to 6.c deemed sufficiently important and backwards compatible to be included in the (then) current official recommended version of Raku. An "officially compliant" Raku compiler passes some official branch of roast. The Rakudo compiler (then) passed 6.c.errata. If you read all the tests involving a feature in, say, the 6.c.errata branch of roast, then you'll have read a full definition of the officially specified meaning of that feature for the 6.c.errata version of the Raku language.

like image 61
raiph Avatar answered Oct 25 '22 23:10

raiph


In general, the place to look for the documentation is Perl6 documentation. That's part of a regex, and you can find it in the definition of character classes. It matches Perl6 identifiers. What the . in front of ident does is to suppress capture.

like image 7
jjmerelo Avatar answered Oct 25 '22 22:10

jjmerelo