Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does "Ꙭ".codePointAt(0)==205 and other Java Character bizarreness?

Tags:

java

unicode

(Lest this get closed as too localized, I chose Ꙭ as an example but this happens for many other characters also)

The character Ꙭ is \uA66C or decimal 42604 (http://unicodinator.com/#A66C). I'm seeing some very weird things I can't understand while using Java's Character class.

1) Character.isLetter('Ꙭ');//won't compile, complains 'unclosed character literal'
2) Character.isLetter("Ꙭ".charAt(0)); //returns true, which is right
3) Character.isLetter(42604);//returns false
4) Character.isLetter('\uA66C');//returns false
5) "Ꙭ".codePointAt(0);//returns 205? 205 is Í http://unicodinator.com/#00CD
6) ("Ꙭ".charAt(0)==(char)42604) //is false

Everything except #2 does not make sense to me. This character is in the BMP and is not from \uD800 to \uDFFF so there shouldn't be any complexity with surrogates. It seems like I'm missing some key concept here...

like image 569
jwl Avatar asked Mar 24 '13 21:03

jwl


1 Answers

It looks as if the character encoding your editor is using is different from that used by javac (or equivalent compiler). javac will default to picking up whichever encoding happens to be set as default on your machine. Use -encoding to change for javac.

Ꙭ in UTF-8 will appear in Latin 1 (or similar) as ê¬ (0xEA 0x99 0xAD), which isn't valid for a character literal as it is three characters.

As for 3 and 4, it apparently was introduced in the relatively new Unicode 5.1.0 (March 2008), which presumably isn't supported by the version of Java you are using. Apparently Java SE 6 uses Unicode 4.0; Java SE 7 uses Unicode 6.0.0.

Most people stick to US ASCII for source files, with good reason.

like image 198
Tom Hawtin - tackline Avatar answered Nov 12 '22 05:11

Tom Hawtin - tackline