Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Character represent all unicode code point?

Since Java char is 16 bit long, I am wondering how can it represent the full unicode code point? It can only represent 65536 code points, is that right?

like image 229
user705414 Avatar asked Dec 22 '22 03:12

user705414


2 Answers

Yes, a Java char is a UTF-16 code unit. If you need to represent Unicode characters outside the Basic Multilingual Plane, you need to use surrogate pairs within a java.lang.String. The String class provides various methods to work with full Unicode code points, such as codePointAt(index).

From section 3.1 of the Java Language Specification:

The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are greater than U+FFFF are called supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range, (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding. A few APIs, primarily in the Character class, use 32-bit integers to represent code points as individual entities. The Java platform provides methods to convert between the two representations.

See the Character docs for more information.

like image 51
Jon Skeet Avatar answered Dec 24 '22 02:12

Jon Skeet


One char, which is unsigned 16 bits, can represent any code point up to 0xFFFF, but not supplemental characters, which are larger. Java is best thought of as using UTF-16 encoding in char, so, supplemental characters are actually represented as pairs of char, a surrogate pair. While one char can't represent such supplemental characters, Java does handle it.

like image 40
Sean Owen Avatar answered Dec 24 '22 01:12

Sean Owen