Here are some excerpts from my copy of the 2014 draft standard N4140 <blockquote> 22.5 Standard code conversion facets [locale.stdcvt] 3 For each of the three code conversion facets <code>codecvt_utf8</code>, <code>codecvt_utf16</code>, and <code>codecvt_utf8_utf16</code>: (3.1) — <code>Elem</code> is the wide-character type, such as <code>wchar_t</code>, <code>char16_t</code>, or <code>char32_t</code>. 4 For the facet <code>codecvt_utf8</code>: (4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of <code>Elem</code>) within the program. </blockquote> One interpretation of these two paragraphs is that <code>wchar_t</code> must be encoded as either UCS2 or UCS4. I don't like it much because if it's true, we have an important property of the language buried deep in a library description. I have tried to find a more direct statement of this property, but to no avail. Another interpretation that <code>wchar_t</code> encoding is not required to be either UCS2 or UCS4, and on implementations where it isn't, <code>codecvt_utf8</code> won't work for <code>wchar_t</code>. I don't like this interpretation much either, because if it's true, and neither <code>char</code> nor <code>wchar_t</code> native encodings are Unicode, there doesn't seem to be a way to portably convert between those native encodings and Unicode. Which of the two interpretations is true? Is there another one which I overlooked? Clarification I'm not asking about general opinions about suitability of <code>wchar_t</code> for software development, or properties of <code>wchar_t</code> one can derive from elsewhere. I am interested in these two specific paragraphs of the standard. I'm trying to understand what these specific paragraphs entail or do not entail. Clarification 2. If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem. It doesn't. It says what it says. It appears that if one uses <code>std::codecvt_utf8<wchar_t></code>, one ends up with a bunch of <code>wchar_t</code> encoded as UCS2 or UCS4, regardless of the current global locale. (There is no way to specify a locale or any character conversion facet for <code>codecvt_utf8</code>). So the question can be rephrased like this: is the conversion result directly usable with the current global locale (and/or with any possible locale) for output, <code>wctype</code> queries and so on? If not, what it is usable for? (If the second interpretation above is correct, the answer would seem to be "nothing").

No. <code>wchar</code> is only required to hold the biggest locale supported by the compiler. Which could theoretically fit in a char. <blockquote> Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). — C++ [basic.fundamental] 3.9.1/5 </blockquote> as such it's not even required to support Unicode <blockquote> The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers. ISO/IEC 10646:2003 Unicode standard 4.0 </blockquote>

Does the C++ standard mandate an encoding for wchar_t?

Tags:

c++

c++11

unicode

wchar-t

Here are some excerpts from my copy of the 2014 draft standard N4140

22.5 Standard code conversion facets [locale.stdcvt]

3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:
(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

4 For the facet codecvt_utf8:
(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.

One interpretation of these two paragraphs is that wchar_t must be encoded as either UCS2 or UCS4. I don't like it much because if it's true, we have an important property of the language buried deep in a library description. I have tried to find a more direct statement of this property, but to no avail.

Another interpretation that wchar_t encoding is not required to be either UCS2 or UCS4, and on implementations where it isn't, codecvt_utf8 won't work for wchar_t. I don't like this interpretation much either, because if it's true, and neither char nor wchar_t native encodings are Unicode, there doesn't seem to be a way to portably convert between those native encodings and Unicode.

Which of the two interpretations is true? Is there another one which I overlooked?

Clarification I'm not asking about general opinions about suitability of wchar_t for software development, or properties of wchar_t one can derive from elsewhere. I am interested in these two specific paragraphs of the standard. I'm trying to understand what these specific paragraphs entail or do not entail.

Clarification 2. If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem. It doesn't. It says what it says. It appears that if one uses std::codecvt_utf8<wchar_t>, one ends up with a bunch of wchar_t encoded as UCS2 or UCS4, regardless of the current global locale. (There is no way to specify a locale or any character conversion facet for codecvt_utf8). So the question can be rephrased like this: is the conversion result directly usable with the current global locale (and/or with any possible locale) for output, wctype queries and so on? If not, what it is usable for? (If the second interpretation above is correct, the answer would seem to be "nothing").

327

asked Aug 04 '16 14:08

n. 1.8e9-where's-my-share m.

2 Answers

wchar_t is just an integral literal. It has a min value, a max value, etc.

Its size is not fixed by the standard.

If it is large enough, you can store UCS-2 or UCS-4 data in a buffer of wchar_t. This is true regardless of the system you are on, as UCS-2 and UCS-4 and UTF-16 and UTF-32 are just descriptions of integer values arranged in a sequence.

In C++11, there are std APIs that read or write data presuming it has those encodings. In C++03, there are APIs that read or write data using the current locale.

22.5 Standard code conversion facets [locale.stdcvt]

3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:

(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

4 For the facet codecvt_utf8:

(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.

So here it codecvt_utf8_utf16 deals with utf8 on one side, and UCS2 or UCS4 (depending on how big Elem is) on the other. It does conversion.

The Elem (the wide character) is presumed to be encoded in UCS2 or UCS4 depending on how big it is.

This does not mean that wchar_t is encoded as such, it just means this operation interprets the wchar_t as being encoded as such.

How the UCS2 or UCS4 got into the Elem is not something this part of the standard cares about. Maybe you set it in there with hex constants. Maybe you read it from io. Maybe you calculated it on the fly. Maybe you used a high-quality random-number generator. Maybe you added together the bit-values of an ascii string. Maybe you calculated a fixed-point approximation of the log* of the number of seconds it takes the moon to change the Earth's day by 1 second. Not these paragraphs problems. These pragraphs simply mandate how bits are modified and interpreted.

Similar claims hold in other cases. This does not mandate what format wchar_t have. It simply states how these facets interpret wchar_t or char16_t or char32_t or char8_t (reading or writing).

Other ways of interacting with wchar_t use different methods to mandate how the value of the wchar_t is interpreted.

iswalpha uses the (global) locale to interpret the wchar_t, for example. In some locals, the wchar_t may be UCS2. In others, it might be some insane cthulian encoding whose details enable you to see a new color from out of space.

To be explicit: encodings are not the property of data, or bits. Encodings are properties of interpretation of data. Quite often there is only one proper or reasonable interpretation of data that makes any sense, but the data itself is bits.

The C++ standard does not mandate what is stored in a wchar_t. It does mandate what certain operations interpret the contents of a wchar_t to be. That section describes how some facets interpret the data in a wchar_t.

114

answered Sep 23 '22 22:09

Yakk - Adam Nevraumont

No.

wchar is only required to hold the biggest locale supported by the compiler. Which could theoretically fit in a char.

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

— C++ [basic.fundamental] 3.9.1/5

as such it's not even required to support Unicode

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.

ISO/IEC 10646:2003 Unicode standard 4.0

answered Sep 22 '22 22:09

Francesco Dondi

Related questions
                            
                                C++11 method template specialization for return type
                            
                                How to define compile-time (static) constant inside a C++ class?
                            
                                force compiler to reveal type of a variable
                            
                                Is a reference returned from a temporary variable valid?
                            
                                enable_if with is_enum does not work
                            
                                Correct pthread_t initialization and handling
                            
                                Block a thread with sleep vs block without sleep
                            
                                How to return multiple values from a function in c++?
                            
                                Is it possible to compile statically with gcc or g++ on Linux based systems?
                            
                                Already defined error with operator overloading
                            
                                Most efficient way to erase from a set while iterating over it
                            
                                Why is second initialization allowed in C++11
                            
                                Reading quoted string in c++
                            
                                How to mark a constexpr function's parameter unused?
                            
                                Incorrect Multiple Cases in Switch not generating compiler error [duplicate]
                            
                                Delete array pointer c++ when increase pointer?
                            
                                Why does a const char* cast to std::string work?
                            
                                Why can I operate with int > +32767?
                            
                                Is there a QPair class, but for three+ items instead of two?
                            
                                Using the mongodb cxx driver in a cmake c++ project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With