Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Java char uses UTF-16?

Tags:

Recently I read lots of things about Unicode code points and how they evolved over time and sure I read http://www.joelonsoftware.com/articles/Unicode.html this also.

But something I couldn't find the real reason for is why Java uses UTF-16 for a char.

For example, If I had the string which contains 1024 letters of ASCII scoped character string. It means 1024 * 2 bytes which equals 2KB string memory which it will consume in any way.

So if Java base char would be UTF-8 it would be just 1KB of data. Even if the string has any character which needs to 2bytes for example 10 character of "字" naturally it will increase the size of the memory consumption. (1014 * 1 byte) + (10 * 2 bytes) = 1KB + 20 bytes

The result isn't that obvious 1KB + 20 bytes VS. 2KB I don't say about ASCII but my curiosity about this is why is it not UTF-8 which just takes care of multibyte chars also. UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.

Is there any good reason behind this?

like image 511
FZE Avatar asked Mar 26 '16 14:03

FZE


People also ask

Does Java use UTF-16?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.

Why is UTF-16 used?

UTF-16 allows all of the basic multilingual plane (BMP) to be represented as single code units. Unicode code points beyond U+FFFF are represented by surrogate pairs. The interesting thing is that Java and Windows (and other systems that use UTF-16) all operate at the code unit level, not the Unicode code point level.

Why does Java use UTF-16?

Because it used to be UCS-2, which was a nice fixed-length 16-bits. Of course, 16bit turned out not to be enough. They retrofitted UTF-16 in on top. Here is a quote from the Unicode FAQ: Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts.

What is a UTF-16 character?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.


1 Answers

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.

This, and the birth of UTF-16, is further explained by the Unicode FAQ page:

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

As @wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as the no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.

like image 198
nj_ Avatar answered Sep 17 '22 17:09

nj_