Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing a char to a code-point?

Tags:

java

unicode

What is the "correct" way of comparing a code-point to a Java character? For example:

int codepoint = String.codePointAt(0); char token = '\n'; 

I know I can probably do:

if (codepoint==(int) token) { ... } 

but this code looks fragile. Is there a formal API method for comparing codepoints to chars, or converting the char up to a codepoint for comparison?

like image 635
Gili Avatar asked Jun 22 '09 23:06

Gili


People also ask

Do you use == or .equals for char?

A char is not an object, so you can use equals() like you do for strings.

What is a Unicode point?

A Unicode code point is a unique number assigned to each Unicode character (which is either a character or a grapheme). Unfortunately, the Unicode rules allow some juxtaposed graphemes to be interpreted as other graphemes that already have their own code points (precomposed forms).

What is code point in Java?

Java String codePointAt() Method The codePointAt() method returns the Unicode value of the character at the specified index in a string. The index of the first character is 0, the second character is 1, and so on.


1 Answers

A little bit of background: When Java appeared in 1995, the char type was based on the original "Unicode 88" specification, which was limited to 16 bits. A year later, when Unicode 2.0 was implemented, the concept of surrogate characters was introduced to go beyond the 16 bit limit.

Java internally represents all Strings in UTF-16 format. For code points exceeding U+FFFF the code point is represented by a surrogate pair, i.e., two chars with the first being the high-surrogates code unit, (in the range \uD800-\uDBFF), the second being the low-surrogate code unit (in the range \uDC00-\uDFFF).

From the early days, all basic Character methods were based on the assumption that a code point could be represented in one char, so that's what the method signatures look like. I guess to preserve backward compatibility that was not changed when Unicode 2.0 came around and caution is needed when dealing with them. To quote from the Java documentation:

  • The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
  • The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

Casting the char to an int, as you do in your sample, works fine though.

like image 144
Christian Hang-Hicks Avatar answered Oct 09 '22 09:10

Christian Hang-Hicks