Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF usage in C++ code

What is the difference between UTF and UCS.

What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for:

  • Internal representation inside the code
    • For string manipulation at run-time
    • For using the string for display purposes.
  • Best storage representation (i.e. In file)
  • Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)
like image 447
Martin York Avatar asked Oct 14 '08 05:10

Martin York


2 Answers

What is the difference between UTF and UCS.

UCS encodings are fixed width, and are marked by how many bytes are used for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range can't be encoded in a UCS encoding.

UTF encodings are variable width, and marked by the minimum number of bits to store a character. For example, UTF-16 requires at least 16 bits (2 bytes) per character. Characters with large code points are encoded using a larger number of bytes -- 4 bytes for astral characters in UTF-16.

  • Internal representation inside the code
  • Best storage representation (i.e. In file)
  • Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)

For modern systems, the most reasonable storage and transport encoding is UTF-8. There are special cases where others might be appropriate -- UTF-7 for old mail servers, UTF-16 for poorly-written text editors -- but UTF-8 is most common.

Preferred internal representation will depend on your platform. In Windows, it is UTF-16. In UNIX, it is UCS-4. Each has its good points:

  • UTF-16 strings never use more memory than a UCS-4 string. If you store many large strings with characters primarily in the basic multi-lingual plane (BMP), UTF-16 will require much less space than UCS-4. Outside the BMP, it will use the same amount.
  • UCS-4 is easier to reason about. Because UTF-16 characters might be split over multiple "surrogate pairs", it can be challenging to correctly split or render a string. UCS-4 text does not have this issue. UCS-4 also acts much like ASCII text in "char" arrays, so existing text algorithms can be ported easily.

Finally, some systems use UTF-8 as an internal format. This is good if you need to inter-operate with existing ASCII- or ISO-8859-based systems because NULL bytes are not present in the middle of UTF-8 text -- they are in UTF-16 or UCS-4.

like image 188
John Millikin Avatar answered Sep 30 '22 06:09

John Millikin


Have you read Joel Spolsky's article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)?

like image 26
Michael Burr Avatar answered Sep 30 '22 04:09

Michael Burr