Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grammar and unicode characters

Tags:

raku

Why the below Grammar fails to parse for unicode characters?

it parses fine after removing word boundaries from <sym>.

#!/usr/bin/env perl6

grammar G {


  proto rule TOP { * }

  rule TOP:sym<y>  { «<.sym>» }
  rule TOP:sym<✓>  { «<.sym>» }

}

say G.parse('y'); # 「y」
say G.parse('✓'); # Nil
like image 632
hythm Avatar asked Aug 16 '19 03:08

hythm


People also ask

What is an example of a Unicode character?

The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

What is a Unicode character type?

Unicode. Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

How do I identify Unicode characters?

Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. Note, that ASCII includes only the English alphabet.

What do you mean by Unicode?

Unicode is a modern standard for text representation that defines each of the letters and symbols commonly used in today's digital and print media. Unicode has become the top standard for identifying characters in text in nearly any language.


1 Answers

From the « and » "left and right word boundary" doc:

[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.

isn't a word character. So the word boundary assertion fails.

What is and isn't a "word character"

"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:

  • Characters whose Unicode general category starts with an L, which stands for Letter.1

  • Characters whose Unicode general category is Nd, which stands for Number, decimal.2

  • _, an underscore.

"alpha 'Nd under"

In a comment below @p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".

But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).2

This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".

Footnotes

1 Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the character JJ also mentions. There are other letter sub-categories too.

2 Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and are excluded because they have a "typographic context". For a billion+ people their native languages use to denote one and is included in the Nd category for decimal digits. But for another billion+ people their native languages use for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly (Devanagari 6) is included in the Nd category but (Han number 6) is excluded.

like image 135
raiph Avatar answered Oct 03 '22 11:10

raiph