Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java charAt used with characters that have two code units

From Core Java, vol. 1, 9th ed., p. 69:

The character ℤ requires two code units in the UTF-16 encoding. Calling

String sentence = "ℤ is the set of integers"; // for clarity; not in book
char ch = sentence.charAt(1)

doesn't return a space but the second code unit of ℤ.

But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true.

String sentence = "ℤ is the set of integers";
if (sentence.charAt(1) == ' ')
    System.out.println("sentence.charAt(1) returns a space");

Why?

I'm using JDK SE 1.7.0_09 on Ubuntu 12.10, if it's relevant.

like image 439
Patrick Brinich-Langlois Avatar asked Jan 04 '13 03:01

Patrick Brinich-Langlois


People also ask

Is charAt () method in Java?

Java String charAt() MethodThe charAt() method returns the character at the specified index in a string. The index of the first character is 0, the second character is 1, and so on.

What is Unicode code units and Codepoint in Java?

In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.

What can I use instead of charAt?

toCharArray() instead of charAt() It seems converting the string to char[] is much faster than just using charAt() .

Is Java char Unicode or Ascii?

Java actually uses Unicode, which includes ASCII and other characters from languages around the world.


1 Answers

It sounds like tho book is saying that 'ℤ' is not a UTF-16 character in the basic multilingual plane, but in fact it is.

Java uses UTF-16 with surrogate pairs for characters that are not in the basic multilingual plane. Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit. In your example sentence.charAt(0) will return 'ℤ', and sentence.charAt(1) will return ' '.

A character represented by surrogate pairs has two code units making up the character. sentence.charAt(0) would return the first code unit, and sentence.charAt(1) would return the second code unit.

See http://docs.oracle.com/javase/6/docs/api/java/lang/String.html:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

like image 98
MarcFasel Avatar answered Oct 12 '22 21:10

MarcFasel