I am writing some string conversion functions similar to atoi()
or strtoll()
. I wanted to include a version of my function that would accept a char16_t* or char32_t* instead of just a char* or wchar_t*.
My function works fine, but as I was writing it I realized that I do not understand what char16_t or char32_t are. I know that the standard only requires that they are an integer type of at least 16 or 32 bits respectively but the implication is that they are UTF-16 or UTF-32.
I also know that the standard defines a couple of functions but they did not include any *get or *put functions (like they did when they added in wchar.h
in C99).
So I am wondering: what do they expect me to do with char16_t and char32_t?
That's a good question with no apparent answer.
The uchar.h
types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t
or char32_t
) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t
, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.
Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8
, u
, and U
, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.
Testing if a UTF-16 or UTF-32 charter in the ASCII range is one of the "usual" 10 digits, +, - or a "normal" white-space is easy to do as well as convert '0'-'9'
to a digit. Given that, atoi_utf16/32()
proceeds like atoi()
. Simply inspect one character at a time.
Testing if some other UTF-16/UTF-32 is a digit or white-space - that is harder. Code would need an extended isspace(), isdigit()
which can be had be switching locales (setlocale()
) if the needed locale is available. (Note: likely need to restore locale when the function is done.
Converting a character that passes isdigit()
but is not one of the usual 10 to its value is problematic. Anyways, that appears to not even be allowed.
Conversion steps:
Set locale to a corresponding one for UTF-16/UTF-32.
Use isspace()
for white-space detection.
Convert is a similar fashion for your_atof()
.
Restore local.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With