Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing Unicode from Scala interpreter

Tags:

When using the scala interpreter (i.e. running the command 'scala' on the commandline), I am not able to print unicode characters correctly. Of course a-z, A-Z, etc. are printed correctly, but for example € or ƒ is printed as a ?.

print(8364.toChar)

results in ? instead of €. Probably I'm doing something wrong. My terminal supports utf-8 characters and even when I pipe the output to a seperate file and open it in a texteditor, ? is displayed.

This is all happening on Mac OS X (Snow Leopard, 10.6.2) with Scala 2.8 (nightly build) and Java 1.6.0_17)

like image 923
Martin Sturm Avatar asked Dec 22 '09 17:12

Martin Sturm


2 Answers

I found the cause of the problem, and a solution to make it work as it should. As I already suspected after posting my question and reading the answer of Calum and issues with encoding on the Mac with another project (which was in Java), the cause of the problem is the default encoding used by Mac OS X. When you start scala interpreter, it will use the default encoding for the specified platform. On Mac OS X, this is Macroman, on Windows it is probably CP1252. You can check this by typing the following command in the scala interpreter:

scala> System.getProperty("file.encoding");
res3: java.lang.String = MacRoman

According to the scala help test, it is possible to provide Java properties using the -D option. However, this does not work for me. I ended up setting the environment variable

JAVA_OPTS="-Dfile.encoding=UTF-8"

After running scala, the result of the previous command will give the following result:

scala> System.getProperty("file.encoding")
res0: java.lang.String = UTF-8

Now, printing special characters works as expected:

print(0x20AC.toChar)               
€

So, it is not a bug in Scala, but an issue with default encodings. In my opinion, it would be better if by default UTF-8 was used on all platforms. In my search for an answer if this is considered, I came across a discussion on the Scala mailing list on this issue. In the first message, it is proposes to use UTF-8 by default on Mac OS X when file.encoding reports Macroman, since UTF-8 is the default charset on Mac OS X (keeps me wondering why file.encoding by defaults is set to Macroman, probably this is an inheritance from Mac OS before 10 was released?). I don't think this proposal will be part of Scala 2.8, since Martin Odersky wrote that it is probably best to keep things as they are in Java (i.e. honor the file.encoding property).

like image 97
Martin Sturm Avatar answered Sep 30 '22 08:09

Martin Sturm


Ok, at least part, if not all, of your problem here is that 128 is not the Unicode codepoint for Euro. 128 (or 0x80 since hex seems to be the norm) is U+0080 <control>, i.e. it is not a printable character, so it's not surprising your terminal is having trouble printing it.

Euro's codepoint is 0x20AC (or in decimal 8364), and that appears to work for me (I'm on Linux, on a nightly of 2.8):

scala> print(0x20AC.toChar)
€

Another fun test is to print the Unicode snowman character:

scala> print(0x2603.toChar)
☃

128 as € is apparently an extended character from one of the Windows code pages.

I got the other character you mentioned to work too:

scala> 'ƒ'.toInt
res8: Int = 402

scala> 402.toChar
res9: Char = ƒ
like image 36
Calum Avatar answered Sep 30 '22 10:09

Calum