Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What should I use? UTF8 or UTF16?

Tags:

I have to distribute my app internationally.

Let's say I have a control (like a memo) where the user enters some text. The user can be Japanese, Russian, Canadian, etc. I want to save the string to disk as TXT file for later use. I will use MY OWN function to write the text and not something like TMemo.SaveToFile().

How do I want to save the string to disk? In UTF8 or UTF16 format?

like image 700
Server Overflow Avatar asked Mar 22 '12 08:03

Server Overflow


2 Answers

The main difference between them is that UTF8 is backwards compatible with ASCII. As long as you only use the first 128 characters, an application that is not Unicode aware can still process the data (which may be an advantage or disadvantage, depending on your scenario). In particular, when switching to UTF16 every API function needs to be adjusted for 16bit strings, while with UTF8 you can often leave old API functions untouched if they don't do any string processing. Also UTF8 does not depend on endianess, while UTF16 does, which may complicate string I/O.

A common misconception is that UTF16 is easier to process because each character always occupies exactly two bytes. That is, unfortunately, not true. UTF16 is a variable-length encoding where a character may either take up 2 or 4 bytes. So any difficulties associated with UTF8 regarding variable-length issues apply to UTF16 just as well.

Finally, storage sizes: Another common myth about UTF16 is that it is more storage-efficient than UTF8 for most foreign languages. UTF8 takes less storage for all European languages, which can be encoded with one or two bytes per character. Non-BMP characters take up 4 bytes in both UTF8 and UTF16. The only case in which UTF16 takes less storage is if your text mainly consists of characters from the range U+0800 through U+FFFF, where the characters for Chinese, Japanese and Hindi are stored.

James McNellis gave an excellent talk at BoostCon 2014, discussing the various trade-offs between different encodings in great detail. Even though the talk is titled Unicode in C++, the entire first half is actually language agnostic. A video recording of the full talk is available at Boostcon's Youtube channel, while the slides can be found on github.

like image 117
ComicSansMS Avatar answered Nov 09 '22 18:11

ComicSansMS


Depends on the language of your data.

If your data is mostly in western languages and you want to reduce the amount of storage needed, go with UTF-8 as for those languages it will take about half the storage of UTF-16. You will pay a penalty when reading the data as it will be / needs to be converted to UTF-16 which is the Windows default and used by Delphi's (Unicode) string.

If your data is mostly in non-western languages, UTF-8 can take more storage than UTF-16 as it may take up to 6 4 bytes per character for some. (see comment by @KennyTM)

Basically: do some tests with representative samples of your users' data and see which performs better, both in storage requirements and load times. We have had some surprises with UTF-16 being slower than we thought. The performance gain of not having to transform from UTF-8 to UTF-16 was lost because of disk access as the data volume in UTF-16 is greater.

like image 20
Marjan Venema Avatar answered Nov 09 '22 18:11

Marjan Venema