If I have 2 strings of the same text, one UTF-8, and the other UTF-16.
Is it safe to assume the UTF-8 string will always be smaller, or the same size, as the UTF-16 one? (byte wise)
No, while the UTF-8 text will usually be shorter, it's not always the case.
Anything between U+0000 and U+FFFF will be represented with 2 bytes (one UTF-16 codepoint) in UTF-16.
Characters between U+0800 and U+FFFF will be represented with 3 bytes in UTF-8.
Therefore a text that contains only (or mostly) characters in that range, can easily be longer when represented in UTF-8 than in UTF-16.
Put differently:
Note that 5 and 6 byte sequences used to be defined in UTF-8 but are no longer valid according to the newest standard and were never necessary to represent Unicode codepoints.
No. UTF-8 sometimes uses 3 or more bytes for a single character depending on how many bits it takes to represent the code point (number) of the character.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With