Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should a Haskell parser allow Unicode digits in numeric literals?

As an exercise, I am writing a parser for Haskell from scratch. In making the lexer, I noticed the following rules on the Haskell 2010 Report:

digitascDigit | uniDigit
ascDigit0 | 1 | … | 9
uniDigit → any Unicode decimal digit
octit0 | 1 | … | 7
hexitdigit | A | … | F | a | … | f

decimaldigit{digit}
octaloctit{octit}
hexadecimalhexit{hexit}

integerdecimal | 0o octal | 0O octal | 0x hexadecimal | 0X hexadecimal
floatdecimal . decimal [exponent] | decimal exponent
exponent → (e | E) [+ | -] decimal

Decimal and hexadecimal literals, along with float literals, are all based on digit, which admits any Unicode decimal digit, instead of ascDigit, which admits only the basic digits 0-9 from ASCII. Strangely, octal is based on octit, which instead only admits the ASCII digits 0-7. I would guess that these "Unicode decimal digit"s are any Unicode codepoints with the "Nd" General Category. However, this includes characters such as the Full-Width digits 0-9 and the Devanagari numerals ०-९. I can see why it might be desirable to allow these in identifiers, but I can see no benefit whatsoever for allowing one to write ९0 for the literal 90.

GHC seems to agree with me. When I try to compile this file,

module DigitTest where
x1 = 1

it spits out this error.

digitTest1.hs:2:6: error: lexical error at character '\65297'
  |
2 | x1 = 1
  |      ^

However, this file

module DigitTest where
x1 = 1

compiles just fine. Am I reading the language specification incorrectly? Is GHC's (sensible) behavior actually correct, or does it technically go against the specification in the Report? I can find no mention of this anywhere.

like image 225
Ian Scherer Avatar asked Jan 26 '20 22:01

Ian Scherer


1 Answers

In the GHC source code file compiler/parser/Lexer.x, you can find the following code:

ascdigit  = 0-9
$unidigit  = \x03 -- Trick Alex into handling Unicode. See [Unicode in Alex].
$decdigit  = $ascdigit -- for now, should really be $digit (ToDo)
$digit     = [$ascdigit $unidigit]
...
$binit     = 0-1
$octit     = 0-7
$hexit     = [$decdigit A-F a-f]
...
@numspc       = _*                   -- numeric spacer (#14473)
@decimal      = $decdigit(@numspc $decdigit)*
@binary       = $binit(@numspc $binit)*
@octal        = $octit(@numspc $octit)*
@hexadecimal  = $hexit(@numspc $hexit)*
@exponent     = @numspc [eE] [\-\+]? @decimal
@bin_exponent = @numspc [pP] [\-\+]? @decimal

Here, $decdigit is used for parsing decimal and hexadecimal literals (and their floating point variants), while $digit is used for the "numeric" part of alphanumeric identifiers. The "ToDo" note makes it clear that this is a recognized deviation of GHC from the language standard.

So, you're reading the spec correctly, and GHC is semi-intentionally violating the spec. There's an open ticket that suggests at least documenting the deviation, but I don't think anyone's expressed any interest in fixing it.

like image 63
K. A. Buhr Avatar answered Nov 09 '22 02:11

K. A. Buhr