Haskell source encoding

Question

The Haskell 2010 Language Report says:

Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.

Does this mean UTF-8?

In ghc-7.0.4/compiler/parser/Lexer.x.source:

$unispace    = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar   = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab         = \t

$ascdigit  = 0-9
$unidigit  = \x03 -- Trick Alex into handling Unicode. See alexGetChar.
$decdigit  = $ascdigit -- for now, should really be $digit (ToDo)
$digit     = [$ascdigit $unidigit]

$special   = [\,\;\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\./\<\=\>\?\@\\^\|\-\~]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar.
$symbol    = [$ascsymbol $unisymbol] # [$special \_\:\"\']

$unilarge  = \x01 -- Trick Alex into handling Unicode. See alexGetChar.
$asclarge  = [A-Z]
$large     = [$asclarge $unilarge]

$unismall  = \x02 -- Trick Alex into handling Unicode. See alexGetChar.
$ascsmall  = [a-z]
$small     = [$ascsmall $unismall \_]

$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar.
$graphic   = [$small $large $symbol $digit $special $unigraphic \:\"\']

...I'm not sure what to make of this. alexGetChar wasn't really helpful.

hammar · Accepted Answer

There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.

In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.

Ivan Danilov · Answer

Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here. The difference explained pretty well there.

Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.

Haskell source encoding

Tags:

haskell

encoding

György Andrasek

2 Answers

hammar

Ivan Danilov

Recent Activity

Donate For Us

Haskell source encoding

Tags:

haskell

encoding

György Andrasek

2 Answers

hammar

Ivan Danilov

Related questions

Recent Activity

Donate For Us