I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character. "\U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?
Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8. These code points are the same as those in ASCII CCSID 367. Any other character is encoded with more than 1 byte in UTF-8.
Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide.
And, every char is made up of 2 bytes because Java internally uses UTF-16. For instance, if a String contains a word in the English language, the leading 8 bits will all be 0 for every char, as an ASCII character can be represented using a single byte.
A Java char takes always 16 bits. A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters.
A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).
Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).
You need to do this:
final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char
is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.
Given that a char
is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\uD83D\uDF01"
-- or directly as the symbol if your computing environment has support for it.
See also the CharsetDecoder
and CharsetEncoder
classes.
See also String.codePointCount()
, and, since Java 8, String.codePoints()
(inherited from CharSequence
).
String s = "𩸽";
Technically this is one character. But be careful s.length()
will returns 2. Also java won't compile String s = '𩸽'
. Java don't promise you that String.length()
shall returns exact number of characters, it returns just number of java-chars required for store this string.
Real number of characters can be obtained from s.codePointCount(0, s.length())
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With