Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Java use modified UTF-8 instead of UTF-8? [closed]

Why does Java use modified UTF-8 rather than standard UTF-8 for object serialization and JNI?

One possible explanation is that modified UTF-8 can't have embedded null characters and therefore one can use functions that operate on null-terminated strings with it. Are there any other reasons?

like image 622
vitaut Avatar asked Mar 15 '13 19:03

vitaut


People also ask

Why do we need to convert Java strings to UTF-8?

Note: Java encodes all Strings into UTF-16, which uses a minimum of two bytes to store code points. Why would we need to convert to UTF-8 then? Not all input might be UTF-16, or UTF-8 for that matter. You might actually receive an ASCII-encoded String, which doesn't support as many characters as UTF-8.

What is UTF-8 and why do we need it?

UTF-8 uses one byte to represent code points from 0-127, making the first 128 code points a one-to-one map with ASCII characters, so UTF-8 is backward-compatible with ASCII. Note: Java encodes all Strings into UTF-16, which uses a minimum of two bytes to store code points. Why would we need to convert to UTF-8 then?

Why does Java use UTF-16 for character encoding?

So the encoding applied is indeed UTF-16, but the character set to which it is applied is a proper subset of the entire Unicode character set, and this guarantees that Java always uses two bytes per token in its internal String encodings. This is not correct for the current Java versions.

How does this string constructor work in UTF-16?

This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding. But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16.


2 Answers

It is faster and simpler for handling supplementary characters (by not handling them).

Java represent characters as 16 bit chars, but unicode has evolved to contain more than 64K characters. So some characters, the supplementary characters, has to be encoded in 2 chars (surrogate pair) in Java.

Strict UTF-8 requires that the encoder converts surrogate pairs into characters then encode characters into bytes. The decoder needs to split supplementary characters back to surrogate pairs.

chars -> character -> bytes -> character -> chars

Since both ends are Java, we can take some shortcut and encode directly on the char level

char -> bytes -> char

neither encoder nor decoder need to worry about surrogate pairs.

like image 76
ZhongYu Avatar answered Oct 11 '22 14:10

ZhongYu


I suspect that's the main reason. In C land, having to deal with strings can contain embedded NULs would complicate things.

like image 24
NPE Avatar answered Oct 11 '22 13:10

NPE