Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Unicode encoding

A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?

Does this boil down to what character encoding you are using?

like image 507
Marcus Leon Avatar asked Mar 28 '10 13:03

Marcus Leon


People also ask

Does Java use UTF-8 or UTF-16?

encoding attribute, Java uses “UTF-8” character encoding by default. Character encoding basically interprets a sequence of bytes into a string of specific characters. The same combination of bytes can denote different characters in different character encoding.

What Unicode format does Java use?

Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two char s. This is reflected by API methods such as String.

Can I use Unicode in Java?

Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information. However, note that they are interpreted by the compiler early.

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8 Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).


2 Answers

You can handle them all if you're careful enough.

Java's char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).

See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.

(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)

like image 143
kennytm Avatar answered Sep 22 '22 22:09

kennytm


Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two chars. This is reflected by API methods such as String.codePointAt().

And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.

like image 42
Michael Borgwardt Avatar answered Sep 21 '22 22:09

Michael Borgwardt