Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Java 16 bit chars support Unicode?

Tags:

java

unicode

Javas char is 16 bit, yet Unicode have far more characters - how does Java deal with that ?

like image 873
leeeroy Avatar asked Dec 21 '09 17:12

leeeroy


People also ask

How does Unicode work in Java?

Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information. However, note that they are interpreted by the compiler early.

Is a 16-bit Unicode character?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

Does Java char use ASCII or Unicode?

Java actually uses Unicode, which includes ASCII and other characters from languages around the world.

Why does Java use the Unicode character set?

An even same code may represent a different character in one language and may represent other characters in another language. To overcome above shortcoming, the unicode system was developed where each character is represented by 2 bytes. As Java was developed for multilingual languages it adopted the unicode system.


2 Answers

http://en.wikipedia.org/wiki/UTF-16

In computing, UTF-16 (16-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding form maps each character to a sequence of 16-bit words. Characters are known as code points and the 16-bit words are known as code units. For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word. For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair. All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.

like image 124
Amber Avatar answered Oct 12 '22 23:10

Amber


Java Strings are UTF-16 (big endian), so a Unicode code point can be one or two characters. Under this encoding, Java can represent the code point U+1D50A (MATHEMATICAL FRAKTUR CAPITAL G) using the chars 0xD835 0xDD0A (String literal "\uD835\uDD0A"). The Character class provides methods for converting to/from code points.

// Unicode code point to char array
char[] math_fraktur_cap_g = Character.toChars(0x1D50A);
like image 44
McDowell Avatar answered Oct 13 '22 01:10

McDowell