I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows: <pre class="prettyprint"><code>final char[] chars = Character.toChars(0x1F701); final String s = new String(chars); final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8); final byte[] asBytes16 = s.getBytes(StandardCharsets.UTF_16); </code></pre> <code>chars</code> has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP). <code>asBytes</code> has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense. <code>asBytes16</code> has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

<blockquote> <code>asBytes</code> has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense. </blockquote> Actually no, the number of <code>char</code>s needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself. Codepoint U+1F701 (<code>0x1F701</code>) uses 17 bits (<code>11111011100000001</code>) <code>0x1F701</code> requires 4 bytes in UTF-8 (<code>F0 9F 9C 81</code>) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629. <blockquote> <code>asBytes16</code> has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character? </blockquote> Per the Java documentation for <code>StandardCharsets</code> <blockquote> UTF_16 <pre class="prettyprint"><code>public static final Charset UTF_16 </code></pre> Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark </blockquote> <code>0x1F701</code> requires 4 bytes in UTF-16 (<code>D8 3D DF 01</code>) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781. UTF-16 is subject to endian, unlike UTF-8, so <code>StandardCharsets.UTF_16</code> includes a BOM to specify the actual endian used in the byte array. To avoid the BOM, use <code>StandardCharsets.UTF_16BE</code> or <code>StandardCharsets.UTF_16LE</code> as needed: <blockquote> UTF_16BE <pre class="prettyprint"><code>public static final Charset UTF_16BE </code></pre> Sixteen-bit UCS Transformation Format, big-endian byte order UTF_16LE <pre class="prettyprint"><code>public static final Charset UTF_16LE </code></pre> Sixteen-bit UCS Transformation Format, little-endian byte order </blockquote> Since their endian is implied in their names, they don't need to include a BOM in the byte array.

Why does this unicode character end up as 6 bytes with UTF-16 encoding?

Tags:

java

unicode

I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte[] asBytes16 = s.getBytes(StandardCharsets.UTF_16);

chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

267

asked Jan 04 '19 11:01

mahonya

2 Answers

UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

You can convert byte[] to hex-encoded String as per this answer:

asBytes   = F09F9C81
asBytes16 = FEFFD83DDF01

answered Nov 15 '22 15:11

Karol Dowbecki

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.

Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)

0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

Per the Java documentation for StandardCharsets

UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.

UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.

To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:

UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order

UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order

Since their endian is implied in their names, they don't need to include a BOM in the byte array.

answered Nov 15 '22 17:11

Remy Lebeau

Related questions
                            
                                Isolating a static singleton class using class loaders
                            
                                How convert a non-generic List to a List<String>?
                            
                                How to convert java String to ObjectId of mongodb _id using java programming language [duplicate]
                            
                                Assert for null check
                            
                                Null Pointer Exception in Reading file from resources java
                            
                                SWAGGER swagger-codegen configuration
                            
                                Where to put external jars in jdk10
                            
                                GraphQL and Spring Boot 2.0.3
                            
                                Read file using Java nio files walk()
                            
                                How to provide Parameter for @BeforeEach method at each @Test
                            
                                Not able to return ResponseEntity with Exception Details in spring
                            
                                Android location manager error 'Exiting with error onLocationChanged line 152 "1"'
                            
                                Difference between verifyNoMoreInteractions and verifyZeroInteractions in Mockito
                            
                                Java StackOverflowError at java.io.PrintStream.write(PrintStream.java:480) and no further stack trace
                            
                                Spring WebFlux add WebFIlter to match specific paths
                            
                                Grouping By without using a POJO in java 8
                            
                                How to handle missing Swing PLAF classes in Java 11?
                            
                                CompletableFuture in Java8
                            
                                java.lang.String: length() vs. count?
                            
                                Apache Derby gives strange names to indices I created with meaningful names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With