UTF usage in C++ code

Question

What is the difference between UTF and UCS.

What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for:

Internal representation inside the code
- For string manipulation at run-time
- For using the string for display purposes.
Best storage representation (i.e. In file)
Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)

John Millikin · Accepted Answer

What is the difference between UTF and UCS.

UCS encodings are fixed width, and are marked by how many bytes are used for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range can't be encoded in a UCS encoding.

UTF encodings are variable width, and marked by the minimum number of bits to store a character. For example, UTF-16 requires at least 16 bits (2 bytes) per character. Characters with large code points are encoded using a larger number of bytes -- 4 bytes for astral characters in UTF-16.

Internal representation inside the code

Best storage representation (i.e. In file)

Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)

For modern systems, the most reasonable storage and transport encoding is UTF-8. There are special cases where others might be appropriate -- UTF-7 for old mail servers, UTF-16 for poorly-written text editors -- but UTF-8 is most common.

Preferred internal representation will depend on your platform. In Windows, it is UTF-16. In UNIX, it is UCS-4. Each has its good points:

UTF-16 strings never use more memory than a UCS-4 string. If you store many large strings with characters primarily in the basic multi-lingual plane (BMP), UTF-16 will require much less space than UCS-4. Outside the BMP, it will use the same amount.
UCS-4 is easier to reason about. Because UTF-16 characters might be split over multiple "surrogate pairs", it can be challenging to correctly split or render a string. UCS-4 text does not have this issue. UCS-4 also acts much like ASCII text in "char" arrays, so existing text algorithms can be ported easily.

Finally, some systems use UTF-8 as an internal format. This is good if you need to inter-operate with existing ASCII- or ISO-8859-based systems because NULL bytes are not present in the middle of UTF-8 text -- they are in UTF-16 or UCS-4.

Michael Burr · Answer

Have you read Joel Spolsky's article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)?

UTF usage in C++ code

Tags:

c++

unicode

locale

utf

ucs

Martin York

2 Answers

John Millikin

Michael Burr

Recent Activity

Donate For Us

UTF usage in C++ code

Tags:

c++

unicode

locale

utf

ucs

Martin York

2 Answers

John Millikin

Michael Burr

Related questions

Recent Activity

Donate For Us