Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C11 Unicode Support

Tags:

c

unicode

c11

I am writing some string conversion functions similar to atoi() or strtoll(). I wanted to include a version of my function that would accept a char16_t* or char32_t* instead of just a char* or wchar_t*.

My function works fine, but as I was writing it I realized that I do not understand what char16_t or char32_t are. I know that the standard only requires that they are an integer type of at least 16 or 32 bits respectively but the implication is that they are UTF-16 or UTF-32.

I also know that the standard defines a couple of functions but they did not include any *get or *put functions (like they did when they added in wchar.h in C99).

So I am wondering: what do they expect me to do with char16_t and char32_t?

like image 563
John Vulconshinz Avatar asked Sep 29 '14 18:09

John Vulconshinz


2 Answers

That's a good question with no apparent answer.

The uchar.h types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t or char32_t) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.

Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8, u, and U, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.

like image 152
R.. GitHub STOP HELPING ICE Avatar answered Oct 20 '22 20:10

R.. GitHub STOP HELPING ICE


Testing if a UTF-16 or UTF-32 charter in the ASCII range is one of the "usual" 10 digits, +, - or a "normal" white-space is easy to do as well as convert '0'-'9' to a digit. Given that, atoi_utf16/32() proceeds like atoi(). Simply inspect one character at a time.

Testing if some other UTF-16/UTF-32 is a digit or white-space - that is harder. Code would need an extended isspace(), isdigit() which can be had be switching locales (setlocale()) if the needed locale is available. (Note: likely need to restore locale when the function is done.

Converting a character that passes isdigit() but is not one of the usual 10 to its value is problematic. Anyways, that appears to not even be allowed.

Conversion steps:

  1. Set locale to a corresponding one for UTF-16/UTF-32.

  2. Use isspace() for white-space detection.

  3. Convert is a similar fashion for your_atof().

  4. Restore local.

like image 34
chux - Reinstate Monica Avatar answered Oct 20 '22 21:10

chux - Reinstate Monica