Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell source encoding

The Haskell 2010 Language Report says:

Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.

Does this mean UTF-8?

In ghc-7.0.4/compiler/parser/Lexer.x.source:

$unispace    = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar   = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab         = \t

$ascdigit  = 0-9
$unidigit  = \x03 -- Trick Alex into handling Unicode. See alexGetChar.
$decdigit  = $ascdigit -- for now, should really be $digit (ToDo)
$digit     = [$ascdigit $unidigit]

$special   = [\(\)\,\;\[\]\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\@\\\^\|\-\~]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar.
$symbol    = [$ascsymbol $unisymbol] # [$special \_\:\"\']

$unilarge  = \x01 -- Trick Alex into handling Unicode. See alexGetChar.
$asclarge  = [A-Z]
$large     = [$asclarge $unilarge]

$unismall  = \x02 -- Trick Alex into handling Unicode. See alexGetChar.
$ascsmall  = [a-z]
$small     = [$ascsmall $unismall \_]

$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar.
$graphic   = [$small $large $symbol $digit $special $unigraphic \:\"\']

...I'm not sure what to make of this. alexGetChar wasn't really helpful.

like image 642
György Andrasek Avatar asked Jul 23 '11 02:07

György Andrasek


2 Answers

There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.

In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.

like image 189
hammar Avatar answered Oct 28 '22 22:10

hammar


Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here. The difference explained pretty well there.

Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.

like image 28
Ivan Danilov Avatar answered Oct 28 '22 22:10

Ivan Danilov