Wrong bytes from UTF-16 encoding

Question

I have a character '😭' Unicode value is U+1F62D binary equivalent is 11111011000101101 . Now I want to convert this character to byte array . My steps

1) As binary representation is bigger than 2 bytes I use 4 bytes

XXXXXXXX XXXXXXX1 11110110 00101101

2) Now I replace all 'X' with '0'

00000000 00000001 11110110 00101101

3) Decimal equivalents

00000000(0) 00000001(1) 11110110(-10) 00101101(45)

This is my code

@Test
    public void testUtf16With4Bytes() throws Exception {
        assertThat(
                new String(
                        new byte[]{0,1,-10,45},
                        StandardCharsets.UTF_16BE
                ),
                is("😭")
        );
    }

This is the output

ava.lang.AssertionError: 
Expected: is "😭"
     but: was ""

What did I miss ?

Karol Dowbecki · Accepted Answer

You miss that some UTF characters are stored as surrogate pairs:

In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFD are stored as a single 16 bits unit. Non-BMP characters (range U+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: an high surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in range U+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogate characters are always written as pairs (high followed by low).

😭 character is U+1F62D so it falls into U+10000—U+10FFFF range. It's represented with a surrogate pair U+D83D U+DE2D, as byte[] it would be [-40, 61, -34, 45].

Wrong bytes from UTF-16 encoding

Tags:

java

unicode

utf-16

Almas Abdrazak

1 Answers

Karol Dowbecki

Recent Activity

Donate For Us

Wrong bytes from UTF-16 encoding

Tags:

java

unicode

utf-16

Almas Abdrazak

1 Answers

Karol Dowbecki

Related questions

Recent Activity

Donate For Us