Will UTF-8 strings always be shorter than UTF-16?

Question

If I have 2 strings of the same text, one UTF-8, and the other UTF-16.
Is it safe to assume the UTF-8 string will always be smaller, or the same size, as the UTF-16 one? (byte wise)

Joachim Sauer · Accepted Answer

No, while the UTF-8 text will usually be shorter, it's not always the case.

Anything between U+0000 and U+FFFF will be represented with 2 bytes (one UTF-16 codepoint) in UTF-16.

Characters between U+0800 and U+FFFF will be represented with 3 bytes in UTF-8.

Therefore a text that contains only (or mostly) characters in that range, can easily be longer when represented in UTF-8 than in UTF-16.

Put differently:

U+0000 - U+007F: UTF-8 is shorter (1 < 2)
U+0080 - U+07FF: Both are the same size (2 = 2)
U+0800 - U+FFFF: UTF-8 is longer (3 > 2)
U+10000 - U+10FFFF: Both are the same size (4 = 4)

Note that 5 and 6 byte sequences used to be defined in UTF-8 but are no longer valid according to the newest standard and were never necessary to represent Unicode codepoints.

David Grayson · Answer

No. UTF-8 sometimes uses 3 or more bytes for a single character depending on how many bits it takes to represent the code point (number) of the character.

Will UTF-8 strings always be shorter than UTF-16?

Tags:

text

encoding

unicode

utf-8

utf-16

Josh

2 Answers

Joachim Sauer

David Grayson

Recent Activity

Donate For Us

Will UTF-8 strings always be shorter than UTF-16?

Tags:

text

encoding

unicode

utf-8

utf-16

Josh

2 Answers

Joachim Sauer

David Grayson

Related questions

Recent Activity

Donate For Us