Why does Java use modified UTF-8 rather than standard UTF-8 for object serialization and JNI?
One possible explanation is that modified UTF-8 can't have embedded null characters and therefore one can use functions that operate on null-terminated strings with it. Are there any other reasons?
Note: Java encodes all Strings into UTF-16, which uses a minimum of two bytes to store code points. Why would we need to convert to UTF-8 then? Not all input might be UTF-16, or UTF-8 for that matter. You might actually receive an ASCII-encoded String, which doesn't support as many characters as UTF-8.
UTF-8 uses one byte to represent code points from 0-127, making the first 128 code points a one-to-one map with ASCII characters, so UTF-8 is backward-compatible with ASCII. Note: Java encodes all Strings into UTF-16, which uses a minimum of two bytes to store code points. Why would we need to convert to UTF-8 then?
So the encoding applied is indeed UTF-16, but the character set to which it is applied is a proper subset of the entire Unicode character set, and this guarantees that Java always uses two bytes per token in its internal String encodings. This is not correct for the current Java versions.
This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding. But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16.
It is faster and simpler for handling supplementary characters (by not handling them).
Java represent characters as 16 bit char
s, but unicode has evolved to contain more than 64K characters. So some characters, the supplementary characters, has to be encoded in 2 char
s (surrogate pair) in Java.
Strict UTF-8 requires that the encoder converts surrogate pairs into characters then encode characters into bytes. The decoder needs to split supplementary characters back to surrogate pairs.
chars -> character -> bytes -> character -> chars
Since both ends are Java, we can take some shortcut and encode directly on the char
level
char -> bytes -> char
neither encoder nor decoder need to worry about surrogate pairs.
I suspect that's the main reason. In C land, having to deal with strings can contain embedded NULs would complicate things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With