Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the Java ecosystem use different character encodings throughout their software stack?

For instance class files use CESU-8 (sometimes also called MUTF-8), but internally Java first used UCS-2 and now it uses UTF-16. The specification about valid Java source files says that a minimal conforming Java compiler only has to accept ASCII characters.

What's the reason for these choices? Wouldn't it make more sense to use the same encoding throughout the Java ecosystem?

like image 646
soc Avatar asked Jul 13 '10 19:07

soc


2 Answers

ASCII for source files is because at the time it wasn't considered reasonable to expect people to have text editors with full Unicode support. Things have improved since, but they still aren't perfect. The whole \uXXXX thing in Jave is essentially Java's equivalent to C's trigraphs. (When C was created, some keyboards didn't have curly braces, so you had to use trigraphs!)

At the time Java was created, the class file format used UTF-8 and the runtime used UCS-2. Unicode had less than 64k codepoints, so 16 bits was enough. Later, when additional "planes" were added to Unicode, UCS-2 was replaced with the (pretty much) compatible UTF-16, and UTF-8 was replaced with CESU-8 (hence "Compatibility Encoding Scheme...").

In the class file format they wanted to use UTF-8 to save space. The design of the class file format (including the JVM instruction set) was heavily geared towards compactness.

In the runtime they wanted to use UCS-2 because it was felt that saving space was less important than being able to avoid the need to deal with variable-width characters. Unfortunately, this kind of backfired now that it's UTF-16, because a codepoint can now take multiple "chars", and worse, the "char" datatype is now sort of misnamed (it no longer corresponds to a character, in general, but instead corresponds to a UTF-16 code-unit).

like image 118
Laurence Gonsalves Avatar answered Oct 10 '22 04:10

Laurence Gonsalves


MUTF-8 for efficiency, UCS2 for hysterical raisins. :)

In 1993, UCS2 was Unicode; everyone thought 65536 Characters Ought To Be Enough For Everyone.

Later on, when it became clear that indeed, there are an awful lot of languages in the world, it was too late, not to mention a terrible idea, to redefine 'char' to be 32 bits, so instead a mostly-backward-compatible choice was made.

In a way that's closely analogous to the relationship between ASCII and UTF-8, Java strings that don't stray outside the historical UCS2 boundaries are bit-identical to their UTF16 representation. It's only when you colour outside those lines that you have to start worrying about surrogates, etc.

like image 32
Alex Cruise Avatar answered Oct 10 '22 05:10

Alex Cruise