This may sound a bit ridiculous, but GHC fails to compile my string containing bacon, a croissant, cucumber, and a potato:
main = putStrLn "🥓 🥐 🥒 🥔"
I realize I could easily write
main = putStrLn "\x1F953 \x1F950 \x1F952 \x1F954"
to the same effect, but I had always assumed GHC would accept any unicode in its source. So: what are the actual restrictions on unicode characters GHC accepts in source files?
BTW: I realize that supporting this sort of thing is hell for the GHC lexer (actually I ran across the above problem while writing test cases for a lexer I wrote), but I still am a tad bit disappointed.
UTF-16 Encoding A supplementary character consists of two 16-bit values. The first 16-bit value is encoded in the range from 0xD800 to 0xDBFF. The second 16-bit value is encoded in the range from 0xDC00 to 0xDFFF. With supplementary characters, UTF-16 character codes can represent more than one million characters.
The maximum possible number of code points Unicode can support is 1,114,112 through seventeen 16-bit planes. Each plane can support 65,536 different code points. Among the more than one million code points that Unicode can support, version 4.0 curently defines 96,382 characters at plane 0, 1, 2, and 14.
Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.
As Emacs supports editing files containing Unicode out of the box, so does Haskell Mode. As an add-on, Haskell Mode includes the haskell-unicode input method which allows you to easily type a number of Unicode symbols that are useful when writing Haskell code; See (emacs)Input Methods, for more details.
Saving main = putStrLn "🥓 🥐 🥒 🥔"
as UTF-8 and running it with ghc 8.0.1
on macOS, I got:
lexical error in string/character literal at character '\129365'
I found this related (but closed) ghc bug report:
The cause (for both problems) was that older versions of GHC support a older version of Unicode:
$ ghc-7.0.3 -e "Data.Char.generalCategory '\8342'" NotAssigned
So the problem seems to be that the version of ghc we're using doesn't support the newer emojis yet – it thinks the unicode code point is unassigned and errors out even though it's assigned to the emoji in newer versions of unicode.
A related open ghc bug ticket which mostly discusses which whitespace chars are allowed though.
Finally, the lit_error
function in Lexer.x
seems to be where the error is surfaced. There are multiple functions in that file that call that error though, so not sure where it's coming from exactly...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With