Javas char is 16 bit, yet Unicode have far more characters - how does Java deal with that ?
Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information. However, note that they are interpreted by the compiler early.
Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.
Java actually uses Unicode, which includes ASCII and other characters from languages around the world.
An even same code may represent a different character in one language and may represent other characters in another language. To overcome above shortcoming, the unicode system was developed where each character is represented by 2 bytes. As Java was developed for multilingual languages it adopted the unicode system.
http://en.wikipedia.org/wiki/UTF-16
In computing, UTF-16 (16-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding form maps each character to a sequence of 16-bit words. Characters are known as code points and the 16-bit words are known as code units. For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word. For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair. All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.
Java Strings are UTF-16 (big endian), so a Unicode code point can be one or two characters. Under this encoding, Java can represent the code point U+1D50A (MATHEMATICAL FRAKTUR CAPITAL G) using the chars 0xD835 0xDD0A
(String literal "\uD835\uDD0A"
). The Character class provides methods for converting to/from code points.
// Unicode code point to char array
char[] math_fraktur_cap_g = Character.toChars(0x1D50A);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With