Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will UTF-8 strings always be shorter than UTF-16?

If I have 2 strings of the same text, one UTF-8, and the other UTF-16.
Is it safe to assume the UTF-8 string will always be smaller, or the same size, as the UTF-16 one? (byte wise)

like image 843
Josh Avatar asked Jan 04 '13 14:01

Josh


2 Answers

No, while the UTF-8 text will usually be shorter, it's not always the case.

Anything between U+0000 and U+FFFF will be represented with 2 bytes (one UTF-16 codepoint) in UTF-16.

Characters between U+0800 and U+FFFF will be represented with 3 bytes in UTF-8.

Therefore a text that contains only (or mostly) characters in that range, can easily be longer when represented in UTF-8 than in UTF-16.

Put differently:

  • U+0000 - U+007F: UTF-8 is shorter (1 < 2)
  • U+0080 - U+07FF: Both are the same size (2 = 2)
  • U+0800 - U+FFFF: UTF-8 is longer (3 > 2)
  • U+10000 - U+10FFFF: Both are the same size (4 = 4)

Note that 5 and 6 byte sequences used to be defined in UTF-8 but are no longer valid according to the newest standard and were never necessary to represent Unicode codepoints.

like image 70
Joachim Sauer Avatar answered Sep 18 '22 23:09

Joachim Sauer


No. UTF-8 sometimes uses 3 or more bytes for a single character depending on how many bits it takes to represent the code point (number) of the character.

like image 26
David Grayson Avatar answered Sep 19 '22 23:09

David Grayson