UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

Tags:

I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows).

553

asked Oct 05 '08 15:10

dicroce

2 Answers

Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracle by Tom Scott on YouTube - it's just under ten minutes, and a wonderful explanation of the brilliant 'hack' that is UTF-8

answered Sep 19 '22 20:09

Dylan Beattie

A character encoding consists of a sequence of codes that each look up a symbol from a given character set. Please see this good article on Wikipedia on character encoding.

UTF8 (UCS) uses 1 to 4 bytes for each symbol. Wikipedia gives a good rundown of how the multi-byte rundown works:

The most significant bit of a single-byte character is always 0.

The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.

The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.

A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)

The page also shows you a great comparison between the advantages and disadvantages of each character encoding type.

UTF16 (UCS2)

Uses 2 bytes to 4 bytes for each symbol.

UTF32 (UCS4)

uses 4 bytes always for each symbol.

char just means a byte of data and is not an actual encoding. It is not analogous to UTF8/UTF16/ascii. A char* pointer can refer to any type of data and any encoding.

STL:

Both stl's std::wstring and std::string are not designed for variable-length character encodings like UTF-8 and UTF-16.

How to implement:

Take a look at the iconv library. iconv is a powerful character encoding conversion library used by such projects as libxml (XML C parser of Gnome)

Other great resources on character encoding:

tbray.org's Characters vs. Bytes
IANA character sets
www.cs.tut.fi's A tutorial on code issues
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (first mentioned by @Dylan Beattie)

answered Sep 20 '22 20:09

Brian R. Bondy

Related questions
                            
                                OCaml - How do I convert int to string?
                            
                                Assign a nullptr to a std::string is safe?
                            
                                MySQL : left part of a string split by a separator string?
                            
                                Read a Text asset(text file from assets folder) as a String in Kotlin (Android)
                            
                                In Java, for a string x, what is the runtime cost of s.length()? Is it O(1) or O(n)?
                            
                                Efficient way to add spaces between characters in a string
                            
                                Why is True returned when checking if an empty string is in another?
                            
                                Convert string to hexadecimal on command line
                            
                                Rails 3 UTF-8 query string showing up in URL?
                            
                                When should std::string be used over character arrays?
                            
                                Insert a string before the extension in a filename
                            
                                Initialize array in method argument [duplicate]
                            
                                Converting CGFloat to String in Swift
                            
                                How can I remove the last character of a string in python? [duplicate]
                            
                                How does Python's triple-quote string work?
                            
                                Python string 'join' is faster (?) than '+', but what's wrong here?
                            
                                How to split a number into individual digits in c#? [duplicate]
                            
                                How to count frequency of characters in a string?
                            
                                jquery build http query string
                            
                                Add a space between characters in a String [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

Tags:

string

character-encoding

utf-8

utf-16

multibyte

dicroce

People also ask

2 Answers

Dylan Beattie

Brian R. Bondy

Recent Activity

Donate For Us