Why the below <code>Grammar</code> fails to parse for unicode characters? it parses fine after removing word boundaries from <code><sym></code>. <pre class="prettyprint"><code>#!/usr/bin/env perl6 grammar G { proto rule TOP { * } rule TOP:sym<y> { «<.sym>» } rule TOP:sym<✓> { «<.sym>» } } say G.parse('y'); # ｢y｣ say G.parse('✓'); # Nil </code></pre>

From the <code>«</code> and <code>»</code> "left and right word boundary" doc: <blockquote> [<code>«</code>] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right. </blockquote> <code>✓</code> isn't a word character. So the word boundary assertion fails. <h3>What is and isn't a "word character"</h3> "word", in the sense of the <code>\w</code> character class, has the same definition in P6 as it does in P5 (when not using the P5 <code>\a</code> regex modifier), namely letters, some decimal digits, or an underscore: <ul> <li>Characters whose Unicode general category starts with an <code>L</code>, which stands for Letter.1</li> <li>Characters whose Unicode general category is <code>Nd</code>, which stands for Number, decimal.2</li> <li><code>_</code>, an underscore.</li> </ul> <h3>"alpha 'Nd under"</h3> In a comment below @p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum". But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex <code>/<:Nd>/</code>).2 This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under". <h3>Footnotes</h3> 1 Letters are matched by the P6 regex <code>/<:L>/</code>. This includes <code>Ll</code> (Letter, lowercase) (matched by <code>/<:Ll>/</code>) as JJ notes but also others including <code>Lu</code> (Letter, uppercase) and <code>Lo</code> (Letter, other), which latter includes the <code>ら</code> character JJ also mentions. There are other letter sub-categories too. 2 Decimal digits with the Unicode general category <code>Nd</code> are matched by the P6 regex <code>/<:Nd>/</code>. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, <code>1</code> is the English decimal digit denoting one; it is included. But <code>¹</code> and <code>①</code> are excluded because they have a "typographic context". For a billion+ people their native languages use <code>१</code> to denote one and <code>१</code> is included in the <code>Nd</code> category for decimal digits. But for another billion+ people their native languages use <code>一</code> for one but it is excluded from the <code>Nd</code> category (and is in the <code>L</code> category for letters instead). Similarly <code>६</code> (Devanagari 6) is included in the <code>Nd</code> category but <code>六</code> (Han number 6) is excluded.

Grammar and unicode characters

Tags:

raku

Why the below Grammar fails to parse for unicode characters?

it parses fine after removing word boundaries from <sym>.

#!/usr/bin/env perl6

grammar G {


  proto rule TOP { * }

  rule TOP:sym<y>  { «<.sym>» }
  rule TOP:sym<✓>  { «<.sym>» }

}

say G.parse('y'); # ｢y｣
say G.parse('✓'); # Nil

632

asked Aug 16 '19 03:08

hythm

1 Answers

From the « and » "left and right word boundary" doc:

[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.

✓ isn't a word character. So the word boundary assertion fails.

What is and isn't a "word character"

"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:

Characters whose Unicode general category starts with an L, which stands for Letter.¹
Characters whose Unicode general category is Nd, which stands for Number, decimal.²
_, an underscore.

"alpha 'Nd under"

In a comment below @p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".

But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).²

This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".

Footnotes

¹ Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the ら character JJ also mentions. There are other letter sub-categories too.

² Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and ① are excluded because they have a "typographic context". For a billion+ people their native languages use १ to denote one and १ is included in the Nd category for decimal digits. But for another billion+ people their native languages use 一 for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly ६ (Devanagari 6) is included in the Nd category but 六 (Han number 6) is excluded.

135

answered Oct 03 '22 11:10

raiph

Related questions
                            
                                Can I capture the returned value of a routine used in RUN-MAIN?
                            
                                Can't inherit, in Raku, a method trait from a class defined in the same file
                            
                                Is it possible to interpolate Array values in token?
                            
                                How do I bundle the C source in a Raku distribution that uses NativeCall? [closed]
                            
                                How to print an object, type in nqp
                            
                                Get the item with the biggest value in a Bag collection in Raku
                            
                                Overloading a package funcion to detect no arguments have been used
                            
                                How to access attributes in object dynamically in Raku
                            
                                raku: Longest match not doing longest match but quits after first match
                            
                                How to compile a shared library on Windows such that it can be used with NativeCall in raku?
                            
                                Apply a proxy to a variable (not an attribute) using traits
                            
                                Compilation error only when using the repl
                            
                                Changing the target of a `whenever` block from the inside
                            
                                Why does Pakku crash with 'Cannot locate native library libarchive.13.dylib' while using on MacOS Big Sur
                            
                                How to write an `intersperse` function in Perl 6
                            
                                perl6 Need help to understand more about proto regex/token/rule
                            
                                Cro user session gets forgotten
                            
                                What's the convention for when you offer an async variant of the same code?
                            
                                Is there a way to get the version from META6.json in Perl6 module code?
                            
                                What is the best way to flush precompiled perl6 modules?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With