Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows).

like image 553
dicroce Avatar asked Oct 05 '08 15:10

dicroce


People also ask

What is the difference between UTF-8 and UTF-16?

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits. Main UTF-8 pros: Basic ASCII characters like digits, Latin characters with no accents, etc.

Should I use UTF-8 or UTF-16?

If your data is mostly in western languages and you want to reduce the amount of storage needed, go with UTF-8 as for those languages it will take about half the storage of UTF-16.

What is difference between UTF-8 and ASCII?

UTF-8 encodes Unicode characters into a sequence of 8-bit bytes. The standard has a capacity for over a million distinct codepoints and is a superset of all characters in widespread use today. By comparison, ASCII (American Standard Code for Information Interchange) includes 128 character codes.

What is UTF-16 used for?

UTF-16 (16- bit Unicode Transformation Format) is a standard method of encoding Unicode character data. Part of the Unicode Standard version 3.0 (and higher-numbered versions), UTF-16 has the capacity to encode all currently defined Unicode characters.


2 Answers

Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracle by Tom Scott on YouTube - it's just under ten minutes, and a wonderful explanation of the brilliant 'hack' that is UTF-8

like image 56
Dylan Beattie Avatar answered Sep 19 '22 20:09

Dylan Beattie


A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.

UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:

  • The most significant bit of a single-byte character is always 0.
  • The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
  • The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.
  • A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)

The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.

UTF16 (UCS2)

Uses 2 bytes to 4 bytes for each symbol.

UTF32 (UCS4)

uses 4 bytes always for each symbol.

char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.

STL:

Both stl's std::wstring and std::string are not designed for variable-length character encodings like UTF-8 and UTF-16.

How to implement:

Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)

Other great resources on character encoding:

  • tbray.org's Characters vs. Bytes
  • IANA character sets
  • www.cs.tut.fi's A tutorial on code issues
  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (first mentioned by @Dylan Beattie)
like image 32
Brian R. Bondy Avatar answered Sep 20 '22 20:09

Brian R. Bondy