Should a Haskell parser allow Unicode digits in numeric literals?

Question

As an exercise, I am writing a parser for Haskell from scratch. In making the lexer, I noticed the following rules on the Haskell 2010 Report:

digit → ascDigit | uniDigit
ascDigit → 0 | 1 | … | 9
uniDigit → any Unicode decimal digit
octit → 0 | 1 | … | 7
hexit → digit | A | … | F | a | … | f

decimal → digit{digit}
octal → octit{octit}
hexadecimal → hexit{hexit}

integer → decimal | 0o octal | 0O octal | 0x hexadecimal | 0X hexadecimal
float → decimal . decimal [exponent] | decimal exponent
exponent → (e | E) [+ | -] decimal

Decimal and hexadecimal literals, along with float literals, are all based on digit, which admits any Unicode decimal digit, instead of ascDigit, which admits only the basic digits 0-9 from ASCII. Strangely, octal is based on octit, which instead only admits the ASCII digits 0-7. I would guess that these "Unicode decimal digit"s are any Unicode codepoints with the "Nd" General Category. However, this includes characters such as the Full-Width digits ０-９ and the Devanagari numerals ०-९. I can see why it might be desirable to allow these in identifiers, but I can see no benefit whatsoever for allowing one to write ९０ for the literal 90.

GHC seems to agree with me. When I try to compile this file,

module DigitTest where
x1 = １

it spits out this error.

digitTest1.hs:2:6: error: lexical error at character '\65297'
  |
2 | x1 = １
  |      ^

However, this file

module DigitTest where
x１ = 1

compiles just fine. Am I reading the language specification incorrectly? Is GHC's (sensible) behavior actually correct, or does it technically go against the specification in the Report? I can find no mention of this anywhere.

K. A. Buhr · Accepted Answer

In the GHC source code file compiler/parser/Lexer.x, you can find the following code:

ascdigit  = 0-9
$unidigit  = \x03 -- Trick Alex into handling Unicode. See [Unicode in Alex].
$decdigit  = $ascdigit -- for now, should really be $digit (ToDo)
$digit     = [$ascdigit $unidigit]
...
$binit     = 0-1
$octit     = 0-7
$hexit     = [$decdigit A-F a-f]
...
@numspc       = _*                   -- numeric spacer (#14473)
@decimal      = $decdigit(@numspc $decdigit)*
@binary       = $binit(@numspc $binit)*
@octal        = $octit(@numspc $octit)*
@hexadecimal  = $hexit(@numspc $hexit)*
@exponent     = @numspc [eE] [\-\+]? @decimal
@bin_exponent = @numspc [pP] [\-\+]? @decimal

Here, $decdigit is used for parsing decimal and hexadecimal literals (and their floating point variants), while $digit is used for the "numeric" part of alphanumeric identifiers. The "ToDo" note makes it clear that this is a recognized deviation of GHC from the language standard.

So, you're reading the spec correctly, and GHC is semi-intentionally violating the spec. There's an open ticket that suggests at least documenting the deviation, but I don't think anyone's expressed any interest in fixing it.

Should a Haskell parser allow Unicode digits in numeric literals?

Tags:

syntax

language-lawyer

haskell

literals

Ian Scherer

1 Answers

K. A. Buhr

Recent Activity

Donate For Us

Should a Haskell parser allow Unicode digits in numeric literals?

Tags:

syntax

language-lawyer

haskell

literals

Ian Scherer

1 Answers

K. A. Buhr

Related questions

Recent Activity

Donate For Us