Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this unicode character end up as 6 bytes with UTF-16 encoding?

Tags:

java

unicode

I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte[] asBytes16 = s.getBytes(StandardCharsets.UTF_16);

chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

like image 267
mahonya Avatar asked Jan 04 '19 11:01

mahonya


People also ask

What is a UTF-16 character?

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.

How many bytes is a Unicode character?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide.

How many bytes is a UTF-16 character?

Likewise, UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes). All UTFs include the full Unicode character repertoire , or set of characters.

Is Unicode and UTF-16 the same?

Utf-8 and utf-16 are character encodings that each handle the 128,237 characters of Unicode that cover 135 modern and historical languages. Unicode is a standard and utf-8 and utf-16 are implementations of the standard. While Unicode is currently 128,237 characters it can handle up to 1,114,112 characters.


2 Answers

UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

You can convert byte[] to hex-encoded String as per this answer:

asBytes   = F09F9C81
asBytes16 = FEFFD83DDF01
like image 71
Karol Dowbecki Avatar answered Nov 15 '22 15:11

Karol Dowbecki


asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.

Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)

0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

Per the Java documentation for StandardCharsets

UTF_16

public static final Charset UTF_16

Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.

UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.

To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:

UTF_16BE

public static final Charset UTF_16BE

Sixteen-bit UCS Transformation Format, big-endian byte order

UTF_16LE

public static final Charset UTF_16LE

Sixteen-bit UCS Transformation Format, little-endian byte order

Since their endian is implied in their names, they don't need to include a BOM in the byte array.

like image 35
Remy Lebeau Avatar answered Nov 15 '22 17:11

Remy Lebeau