From Core Java, vol. 1, 9th ed., p. 69:
The character ℤ requires two code units in the UTF-16 encoding. Calling
String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1)
doesn't return a space but the second code unit of ℤ.
But it seems that sentence.charAt(1)
does return a space. For example, the if
statement in the following code evaluates to true
.
String sentence = "ℤ is the set of integers";
if (sentence.charAt(1) == ' ')
System.out.println("sentence.charAt(1) returns a space");
Why?
I'm using JDK SE 1.7.0_09 on Ubuntu 12.10, if it's relevant.
Java String charAt() MethodThe charAt() method returns the character at the specified index in a string. The index of the first character is 0, the second character is 1, and so on.
In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.
toCharArray() instead of charAt() It seems converting the string to char[] is much faster than just using charAt() .
Java actually uses Unicode, which includes ASCII and other characters from languages around the world.
It sounds like tho book is saying that 'ℤ' is not a UTF-16 character in the basic multilingual plane, but in fact it is.
Java uses UTF-16 with surrogate pairs for characters that are not in the basic multilingual plane. Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit. In your example sentence.charAt(0)
will return 'ℤ', and sentence.charAt(1)
will return ' '.
A character represented by surrogate pairs has two code units making up the character. sentence.charAt(0)
would return the first code unit, and sentence.charAt(1)
would return the second code unit.
See http://docs.oracle.com/javase/6/docs/api/java/lang/String.html:
A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With