Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is 16-bit wchar_t formally valid for representing full Unicode?

In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows' 16-bit wchar_t, with UTF-16 encoding where sometimes two such values (called a “surrogate pair”) is needed for a single Unicode code point, is invalid for representing Unicode.

It's certainly inconvenient and in conflict with the assumption of the C and C++ standard libraries (e.g. character classification) that each code point is represented as a single value, although the Unicode consortium's ²Technical Note 12 from 2004 makes a good case for using UTF-16 for internal processing, with an impressive list of software that does.

And certainly it seems as if the original intent was to have one wchar_t value per code point, consistent with the assumptions of the C and C++ standard libraries. E.g. in the web page “ISO C Amendment 1 (MSE)” over at ³unix.org, about the amendment that brought wchar_t into the C standard in 1995, the authors maintain that

The primary advantage to the one byte/one character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of the wide character was invented. A wide character is an abstract data type large enough to contain the largest character that is supported on a particular platform.

But as it turned out, the C and C++ standards seem to not talk about the largest supported character, but only about the largest extended character sets in the supported locales: that wchar_t must be large enough to represent every code point in the largest such extended character set – but not Unicode, when there is no Unicode locale.

C99 §7.17/2 (from the N869 draft):

[the wchar_t type] is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales.

This is almost identically the same wording as in the C++ standard. And it seems to mean that with a restricted set of supported locales, wchar_t can be smallish indeed, down to a single byte with UTF-8 encoding (a nightmare possibility where e.g. no standard library character classification function would work outside of ASCII's A through Z, but hey). Possibly the following is a requirement to be wider than that:

C99 §7.1.1/4:

A wide character is a code value (a binary encoded integer) of an object of type wchar_t that corresponds to a member of the extended character set.

… since it refers to the extended character set, but that term seems to not be further defined anywhere.

And at least with Microsoft's C and C++ runtime there is no Unicode locale: with that implementation setlocale is restricted to character encodings that have at most 2 bytes per character:

MSDN ⁴documentation of setlocale:

The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.

So it seems that contrary to what I thought I knew, and contrary to my assertion, Windows' 16-bit wchar_t is formally OK. And mainly due to Microsoft's ingenious lack of support for UTF-8 locales, or any locale with more than 2 bytes per character. But is it really so, is 16-bit wchar_t OK?


Links:
¹ news:comp.lang.c++
² http://unicode.org/notes/tn12/#Software_16
³ http://www.unix.org/version2/whatsnew/login_mse.html
⁴ https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

like image 507
Cheers and hth. - Alf Avatar asked Sep 17 '16 15:09

Cheers and hth. - Alf


1 Answers

wchar_t is not now and never was a Unicode character/code point. The C++ standard does not declare that a wide-string literal will contain Unicode characters. The C++ standard does not declare that a wide-character literal will contain a Unicode character. Indeed, the standard doesn't say anything about what wchar_t will contain.

wchar_t can be used with locale-aware APIs, but those are only relative to the implementation-defined encoding, not any particular Unicode encoding. The standard library functions that take these use their knowledge of the implementation's encoding to do their jobs.

So, is a 16-bit wchar_t legal? Yes; the standard does not require that wchar_t be sufficiently large to hold a Unicode codepoint.

Is a string of wchar_t permitted to hold UTF-16 values (or variable width in general)? Well, you are permitted to make strings of wchar_t that store whatever you want (so long as it fits). So for the purposes of the standard, the question is whether standard-provided means for generating wchar_t characters and strings are permitted to use UTF-16.

Well, the standard library can do whatever it wants to; the standard offers no guarantee that a conversion from any particular character encoding to wchar_t will be a 1:1 mapping. Even char->wchar_t conversion via wstring_convert is not required anywhere in the standard to produce a 1:1 character mapping.

If a compiler wishes to declare that the wide character set consists of the Base Multilingual Plane of Unicode, then a declaration like this L'\U0001F000' will produce a single wchar_t. But the value is implementation-defined, per [lex.ccon]/2:

The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined.

And of course, C++ doesn't allow to use surrogate pairs as a c-char; \uD800 is a compile error.

Where things get murky in the standard is the treatment of strings that contain characters outside of the character set. The above text would suggest that implementations can do what they want. And yet, [lex.string]\16 says this:

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

I say this is murky because nothing says what the behavior should be if a c-char in a string literal is outside the range of the destination character set.

Windows compilers (both VS and GCC-on-Windows) do indeed cause L"\U0001F000" to have an array size of 3 (two surrogate pairs and a single NUL terminator). Is that legal C++ standard behavior? What does it mean to provide a c-char to a string literal that is outside of the valid range for a character set?

I would say that this is a hole in the standard, rather than a deficiency in those compilers. It should make it more clear what the conversion behavior in this case ought to be.


In any case, wchar_t is not an appropriate tool for processing Unicode-encoded text. It is not "formally valid" for representing any form of Unicode. Yes, many compilers implement wide-string literals as a Unicode encoding. But since the standard doesn't require this, you cannot rely on it.

Now obviously, you can stick whatever will fit inside of a wchar_t. So even on platforms where wchar_t is 32-bits, you could shove UTF-16 data into them, with each 16-bit word taking up 32-bits. But you couldn't pass such text to any API function that expects the wide character encoding unless you knew that this was the expected encoding for that platform.

Basically, never use wchar_t if you want to work with a Unicode encoding.

like image 155
Nicol Bolas Avatar answered Nov 17 '22 14:11

Nicol Bolas