As an exercise, I am writing a parser for Haskell from scratch. In making the lexer, I noticed the following rules on the Haskell 2010 Report:
digit → ascDigit | uniDigit
ascDigit →0
|1
| … |9
uniDigit → any Unicode decimal digit
octit →0
|1
| … |7
hexit → digit |A
| … |F
|a
| … |f
decimal → digit{digit}
octal → octit{octit}
hexadecimal → hexit{hexit}integer → decimal |
0o
octal |0O
octal |0x
hexadecimal |0X
hexadecimal
float → decimal.
decimal [exponent] | decimal exponent
exponent → (e
|E
) [+
|-
] decimal
Decimal and hexadecimal literals, along with float literals, are all based on digit, which admits any Unicode decimal digit, instead of ascDigit, which admits only the basic digits 0-9 from ASCII. Strangely, octal is based on octit, which instead only admits the ASCII digits 0-7. I would guess that these "Unicode decimal digit"s are any Unicode codepoints with the "Nd" General Category. However, this includes characters such as the Full-Width digits 0-9 and the Devanagari numerals ०-९. I can see why it might be desirable to allow these in identifiers, but I can see no benefit whatsoever for allowing one to write ९0
for the literal 90
.
GHC seems to agree with me. When I try to compile this file,
module DigitTest where
x1 = 1
it spits out this error.
digitTest1.hs:2:6: error: lexical error at character '\65297'
|
2 | x1 = 1
| ^
However, this file
module DigitTest where
x1 = 1
compiles just fine. Am I reading the language specification incorrectly? Is GHC's (sensible) behavior actually correct, or does it technically go against the specification in the Report? I can find no mention of this anywhere.
In the GHC source code file compiler/parser/Lexer.x
, you can find the following code:
ascdigit = 0-9
$unidigit = \x03 -- Trick Alex into handling Unicode. See [Unicode in Alex].
$decdigit = $ascdigit -- for now, should really be $digit (ToDo)
$digit = [$ascdigit $unidigit]
...
$binit = 0-1
$octit = 0-7
$hexit = [$decdigit A-F a-f]
...
@numspc = _* -- numeric spacer (#14473)
@decimal = $decdigit(@numspc $decdigit)*
@binary = $binit(@numspc $binit)*
@octal = $octit(@numspc $octit)*
@hexadecimal = $hexit(@numspc $hexit)*
@exponent = @numspc [eE] [\-\+]? @decimal
@bin_exponent = @numspc [pP] [\-\+]? @decimal
Here, $decdigit
is used for parsing decimal and hexadecimal literals (and their floating point variants), while $digit
is used for the "numeric" part of alphanumeric identifiers. The "ToDo" note makes it clear that this is a recognized deviation of GHC from the language standard.
So, you're reading the spec correctly, and GHC is semi-intentionally violating the spec. There's an open ticket that suggests at least documenting the deviation, but I don't think anyone's expressed any interest in fixing it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With