Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write 3 bytes unicode literal in Java?

I'd like to write unicode literal U+10428 in Java. http://www.marathon-studios.com/unicode/U10428/Deseret_Small_Letter_Long_I

I tried with '\u10428' and it doesn't compile.

like image 972
kawty Avatar asked Jul 08 '14 13:07

kawty


People also ask

How do you specify Unicode characters in Java?

Unicode character literals To print Unicode characters, enter the escape sequence “u”. Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier.

What is a Unicode literal?

A Unicode literal is a sequence of ASCII characters intermixed with escaped sequence of hex digits, all enclosed in quotes and preceded by U&.

What is Unicode format in Java?

Unicode is an international standard of character encoding which has the capability of representing a majority of written languages all over the globe. Unicode uses hexadecimal to represent a character. Unicode is a 16-bit character encoding system. The lowest value is \u0000 and the highest value is \uFFFF.

How is Unicode calculated in Java?

We can determine the unicode category for a particular character by using the getType() method. It is a static method of Character class and it returns an integer value of char ch representing in unicode general category.


1 Answers

Because Java went full-out unicode when people thought 64K are enough for everyone (Where did one hear such before?), they started out with UCS-2 and later upgraded to UTF-16.

But they never bothered to add an escape sequence for unicode characters outside the BMP.

Thus, your only recourse is manually recoding to a UTF-16 surrogate-pair and using two UTF-16 escapes.

Your example codepoint U+10428 is "\uD801\uDC28".

I used this site for the recoding: https://rishida.net/tools/conversion/

Quote from the docs:

3.10.5 String Literals

A string literal consists of zero or more characters enclosed in double quotes. Characters may be represented by escape sequences (§3.10.6) - one escape sequence for characters in the range U+0000 to U+FFFF, two escape sequences for the UTF-16 surrogate code units of characters in the range U+010000 to U+10FFFF.

like image 193
Deduplicator Avatar answered Oct 20 '22 19:10

Deduplicator