Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I need to escape unicode in java source files?

Please note that I'm not asking how but why. And I don't know if it's a RCP specific problem or if it's something inherent to java.

My java source files are encoded in UTF-8.

If I define my literal strings like this :

    new Language("fr", "Français"),
    new Language("zh", "中文")

It works as I expect when I use the string in the application by launching it from Eclipse as an Eclipse application :

enter image description here

But if fails when I launch the .exe built by the "Eclipse Product Export Wizard" :

enter image description here

The solution I use is to escape the chars like this :

    new Language("fr", "Fran\u00e7ais"), // Français
    new Language("zh", "\u4e2d\u6587") // 中文

There is no problem in doing this (all my other strings are in properties files, only the languages names are hardcoded) but I'd like to understand.

I thought the compiler had to convert the java literal strings when building the bytecode. So why is the unicode escaping necessary ? Is it wrong to use use high range unicode chars in java source files ? What happens exactly to those chars at compilation and in what it is different from the handling of escaped chars ? Is the problem just related to RCP cache ?

like image 631
Denys Séguret Avatar asked Jun 27 '12 13:06

Denys Séguret


People also ask

What is Unicode escape in Java?

Unicode Escapes. A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged.

What is the importance of Unicode in Java?

An even same code may represent a different character in one language and may represent other characters in another language. To overcome above shortcoming, the unicode system was developed where each character is represented by 2 bytes. As Java was developed for multilingual languages it adopted the unicode system.

What is escaped Unicode?

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.

What does Unicode mean in Java?

Unicode is a computing industry standard designed to consistently and uniquely encode characters used in written languages throughout the world. The Unicode standard uses hexadecimal to express a character. For example, the value 0x0041 represents the Latin character A.


1 Answers

It appears that the Eclipse Product Export Wizard is not interpreting your files as UTF-8. Perhaps you need to run Eclipse's JVM with the encoding set to UTF-8 (-Dfile.encoding=UTF8 in eclipse.ini)?

(Copypasta'd at OPs request)

like image 85
Matt Ball Avatar answered Oct 07 '22 19:10

Matt Ball