Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Range of unicode characters GHC accepts

This may sound a bit ridiculous, but GHC fails to compile my string containing bacon, a croissant, cucumber, and a potato:

main = putStrLn "🥓  🥐  🥒  🥔"

I realize I could easily write

main = putStrLn "\x1F953  \x1F950  \x1F952  \x1F954"

to the same effect, but I had always assumed GHC would accept any unicode in its source. So: what are the actual restrictions on unicode characters GHC accepts in source files?


BTW: I realize that supporting this sort of thing is hell for the GHC lexer (actually I ran across the above problem while writing test cases for a lexer I wrote), but I still am a tad bit disappointed.

like image 505
Alec Avatar asked Jan 03 '17 07:01

Alec


People also ask

What is the range of Unicode character set?

UTF-16 Encoding A supplementary character consists of two 16-bit values. The first 16-bit value is encoded in the range from 0xD800 to 0xDBFF. The second 16-bit value is encoded in the range from 0xDC00 to 0xDFFF. With supplementary characters, UTF-16 character codes can represent more than one million characters.

What is the highest Unicode character?

The maximum possible number of code points Unicode can support is 1,114,112 through seventeen 16-bit planes. Each plane can support 65,536 different code points. Among the more than one million code points that Unicode can support, version 4.0 curently defines 96,382 characters at plane 0, 1, 2, and 14.

What is a character Unicode?

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.

Does Haskell support Unicode?

As Emacs supports editing files containing Unicode out of the box, so does Haskell Mode. As an add-on, Haskell Mode includes the haskell-unicode input method which allows you to easily type a number of Unicode symbols that are useful when writing Haskell code; See (emacs)Input Methods, for more details.


1 Answers

Saving main = putStrLn "🥓 🥐 🥒 🥔" as UTF-8 and running it with ghc 8.0.1 on macOS, I got:

lexical error in string/character literal at character '\129365'

I found this related (but closed) ghc bug report:

The cause (for both problems) was that older versions of GHC support a older version of Unicode:

$ ghc-7.0.3 -e "Data.Char.generalCategory '\8342'"
NotAssigned

So the problem seems to be that the version of ghc we're using doesn't support the newer emojis yet – it thinks the unicode code point is unassigned and errors out even though it's assigned to the emoji in newer versions of unicode.

A related open ghc bug ticket which mostly discusses which whitespace chars are allowed though.

Finally, the lit_error function in Lexer.x seems to be where the error is surfaced. There are multiple functions in that file that call that error though, so not sure where it's coming from exactly...

like image 89
mb21 Avatar answered Oct 12 '22 02:10

mb21