Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The proper way to handle Unicode with C++ in 2018?

Tags:

c++

unicode

I have tried searching stackoverflow to find an answer to this but the questions and answers I've found are around 10 years old and I can't seem to find consensus on the subject due to changes and possible progress.

There are several libraries that I know of outside of the stl that are supposed to handle unicode-

  • http://userguide.icu-project.org/
  • https://github.com/nemtrif/utfcpp
  • https://github.com/CaptainCrowbar/unicorn-lib

There are a few features of the stl (wstring,codecvt_utf8) that were included but people seem to be ambivalent about using because they deal with UTF-16 which this site: (utf-8 everywhere) says shouldn't be used and many people online seem agree with the premise.

The only thing I'm looking for is the ability to do 4 things with a unicode strings-

  1. Read a string into memory
  2. Search the string with regex using unicode or ascii, concatenate or do text replacement/formatting with it with either ascii+unicode numbers or characters.
  3. Convert to ascii + the unicode number format for characters that don't fit in the ascii range.
  4. Write a string to disk or send wherever.

From what I can tell icu handles this and more. What I would like to know is if there is a standard way of handling this on Linux, Windows, and MacOS.

Thank you for your time.

like image 712
Lfod Avatar asked May 30 '18 21:05

Lfod


People also ask

How does C handle Unicode?

It can represent all 1,114,112 Unicode characters. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII. Characters usually require fewer than four bytes. String sort order is preserved.

How do you write C in Unicode?

Unicode Character “C” (U+0043)

What is Unicode in C language?

Unicode is a globally used standard for character encoding. It is specifically used to assign some code to every character in every linguistic worldwide. There are many other encoding standards. Unfortunately, not a single encoding standard can be applied to all worldwide languages.


1 Answers

I will try to throw some ideas here:

  • most C++ programs/programmers just assume that a text is an almost opaque sequence of bytes. UTF-8 is probably guilty for that, and there is no surprise that many comments resume to: don't worry with Unicode, just process UTF-8 encoded strings

  • files only contains bytes. At a moment, if you try to internally process true Unicode code points, you will have to serialize that to bytes -> here again UTF-8 wins the point

  • as soon as you go out of the Basic Multilingual Plane (16 bits code points), things become more and more complex. The emoji is specifically awful to process: an emoji can be followed by a variation selector (U+FE0E VARIATION SELECTOR-15 (VS15) for text or U+FE0F VARIATION SELECTOR-16 (VS16) for emoji-style) to alter its display style, more or less the old i bs ^ that was used in 1970 ascii when one wanted to print î. That's not all, the characters U+1F3FB to U+1F3FF are use to provide a skin color for 102 human emoji spread across six blocks: Dingbats, Emoticons, Miscellaneous Symbols, Miscellaneous Symbols and Pictographs, Supplemental Symbols and Pictographs, and Transport and Map Symbols.

    That simply means that up to 3 consecutive unicode code points can represent one single glyph... So the idea that one character is one char32_t is still an approximation

My conclusion is that Unicode is a complex thing, and really requires a dedicated library like ICU. You can try to use simple tools like the converters of the standard library when you only deal with the BMP, but full support is far beyond that.


BTW: even other languages like Python that pretend to have a native unicode support (which is IMHO far better than current C++ one) ofter fails on some part:

  • the tkinter GUI library cannot display any code point outside the BMP - while it is the standard IDLE Python tool
  • different modules or the standard library are dedicated to Unicode in addition to the core language support (codecs and unicodedata), and other modules are available in the Python Package Index like the emoji support because the standard library does not meet all needs

So support for Unicode is poor for more than 10 years, and I do not really hope that things will go much better in the next 10 years...

like image 142
Serge Ballesta Avatar answered Oct 08 '22 18:10

Serge Ballesta