Why the below Grammar
fails to parse for unicode characters?
it parses fine after removing word boundaries from <sym>
.
#!/usr/bin/env perl6
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<✓> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('✓'); # Nil
The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).
Unicode. Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.
Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. Note, that ASCII includes only the English alphabet.
Unicode is a modern standard for text representation that defines each of the letters and symbols commonly used in today's digital and print media. Unicode has become the top standard for identifying characters in text in nearly any language.
From the «
and »
"left and right word boundary" doc:
[
«
] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.
✓
isn't a word character. So the word boundary assertion fails.
"word", in the sense of the \w
character class, has the same definition in P6 as it does in P5 (when not using the P5 \a
regex modifier), namely letters, some decimal digits, or an underscore:
Characters whose Unicode general category starts with an L
, which stands for Letter.1
Characters whose Unicode general category is Nd
, which stands for Number, decimal.2
_
, an underscore.
In a comment below @p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".
But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/
).2
This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".
1 Letters are matched by the P6 regex /<:L>/
. This includes Ll
(Letter, lowercase) (matched by /<:Ll>/
) as JJ notes but also others including Lu
(Letter, uppercase) and Lo
(Letter, other), which latter includes the ら
character JJ also mentions. There are other letter sub-categories too.
2 Decimal digits with the Unicode general category Nd
are matched by the P6 regex /<:Nd>/
. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1
is the English decimal digit denoting one; it is included. But ¹
and ①
are excluded because they have a "typographic context". For a billion+ people their native languages use १
to denote one and १
is included in the Nd
category for decimal digits. But for another billion+ people their native languages use 一
for one but it is excluded from the Nd
category (and is in the L
category for letters instead). Similarly ६
(Devanagari 6) is included in the Nd
category but 六
(Han number 6) is excluded.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With