Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any reason to prefer UTF-16 over UTF-8?

Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16.

However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information.

Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well?

EDIT: Meanwhile I've also found this answer, which seems relevant and has some interesting links.

like image 414
Oak Avatar asked May 29 '10 11:05

Oak


People also ask

Why use UTF-16 vs UTF-8?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.

Is UTF-8 outdated?

utf8 is currently an alias for utf8mb3 , but it is now deprecated as such, and utf8 is expected subsequently to become a reference to utf8mb4 .

Should I always use UTF-8?

The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc). However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice.

What is the difference between UTF-8 and UTF-16?

The main difference between UTF-8, UTF-16, and UTF-32 character encoding is how many bytes it requires to represent a character in memory. UTF-8 uses a minimum of one byte, while UTF-16 uses a minimum of 2 bytes.


1 Answers

East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).

Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.

Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.

like image 197
Dean Harding Avatar answered Sep 23 '22 03:09

Dean Harding