What is execution wide-character set and its encoding?

Q: What is execution character set?

The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps. This character set is used for the internal representation of any string or character literals in the compiled code.

Q: What is the meaning of character encoding?

What is Character Encoding? Character encoding tells computers how to interpret digital data into letters, numbers and symbols. This is done by assigning a specific numeric value to a letter, number or symbol. These letters, numbers, and symbols are classified as “characters”.

Q: What is the difference between encoding and character set?

Character Sets and Encoding Schemes The distinction between the two isn't always clear and the terms tend to be used interchangeable. A character set is a list of characters whereas an encoding scheme is how they are represented in binary. This is best seen with Unicode.

Q: Is UTF-8 character set or encoding?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

Tags:

c++

c

character-encoding

language-lawyer

Quite a few concepts related to character set are mentioned in the standard: basic source character set, basic execution character set, basic execution wide-character set, execution character set, and execution wide-character set:

Basic source character set: 91 graphical characters, plus space character, HT, VT, FF, LF (just borrowing name abbreviations from ASCII).
Basic execution (wide-)character set: all members of basic source character set, plus BEL, BS, CR, (wide-)NUL.
The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

I don't have much question for basic source character set, basic execution character set, and basic execution wide-character set.

As for execution character set, the standard says it's implementation-defined and locale-specific, so I tried to get some real sense by observing the byte contents of a string-literal-initialized char array whose value should equal to the numerical value of the encoding of the characters in the execution character set (and a universal-character-name may map to more than one char element due to multibyte encoding):

char str[] = "Greek lowercase alpha is: \u03B1.";

It seems that it's almost always utf-8 on Linux (CE B1 is stored in the array for that Greek letter). On Windows, it's Windows-1252 if system locale is English (some wrong value 3F is stored since Greek is not available in Windows-1252), and some other encoding for other locale (e.g. A6 C1 in cp936 for Chinese locale, E1 in Windows-1253 for Greek locale, representing Greek lowercase alpha in those two encodings respectively). For all those cases where the Greek letter is available in the locale (thus available in the execution character set), cout << str; can print the Greek letter appropriately. All seems alright.

But for execution wide-character set, I don't understand very well. What is its exact encoding on major platforms? It seems that the ISO-10646 value 0x3B1 of the Greek lowercase alpha always gets stored in the wchar_t for a declaration like the one below on all the platforms that I tried:

wchar_t wstr[] = L"Greek lowercase alpha is: \u03B1.";

So I guess the execution wide-charater set may well be UCS-2/UTF-16 or UTF-32 (different environment has different size for wchar_t, 4 for Linux and 2 for Windows mostly)? However, wcout << wstr; doesn't print the Greek letter correctly on either Linux or Windows. Surely the members and encoding of the execution wide-character set is implementation-defined, but that shouldn't be a problem for the implementation-provided iostream facility to recognize and handle that appropriately, right? (While execution character set is also implementation-defined, the iostream facility can handle it alright.) What is the default interpretation of a wchar_t array when handled by iostream facilities? (Anyway, just to clarify, I'm more interested in the nature of the execution wide-character set, rather than finding a correct way to print a wide-character string on certain platforms.)

PS: I'm a total novice for wchar_t stuffs, so my apology if I said something very wrong.

882

asked Feb 26 '14 05:02

goodbyeera

1 Answers

The execution wide-character set is simply the set of characters used to encode wchar_t, at runtime. See N3337 S2.3.

The encoding is implementation-defined. On all modern systems and platforms it would be Unicode (ISO-10646) but nothing makes it so. On older platforms such as IBM mainframe it might be DBCS or something different. You won't see it, but that's what the standard allows.

The EWCS is required to have some specific members and conversions. It is required to work correctly with the library functions. These are not tough restrictions.

The wide characters could actually be short int (as on Windows) or int 32 (as on Unix) and still be the same character set (Unicode).

110

answered Sep 17 '22 12:09

david.pfx

Related questions
                            
                                CL_MEM_ALLOC_HOST_PTR slower than CL_MEM_USE_HOST_PTR
                            
                                How to iterate through std::tuple? [duplicate]
                            
                                Writing files in hdfs in C++ (libhdfs)
                            
                                FFmpeg libraries: Exactly constant segment duration for HLS
                            
                                Emitting keyboard input using WM_CHAR message?
                            
                                How to get what() and back trace at the same time for an uncaught exception?
                            
                                Combine StretchBlt and TransparentBlt properly, so transparent bitmap can be created properly
                            
                                What is the proper OpenGL initialisation on Intel HD 3000?
                            
                                Formal specification of std::vector<T>::pop_back
                            
                                "Logically slower" algorithm turns out to be faster, but why?
                            
                                Cython: How to move large objects without copying them?
                            
                                how to print logs into only one file by google glog
                            
                                templated conversion operator string() doesn't compile
                            
                                CMake error - cmTryCompileExec has stopped working (VTK)
                            
                                Should all comparison operators be constexpr for a potentially constexpr object
                            
                                When is it safe to close CAsyncSocket?
                            
                                Runtime interfaces and object composition in C++
                            
                                Why native wrapped functions in Dart are such heavyweight in comparison with "DEFINE NATIVE ENTRY" functions that are very lightweight?
                            
                                Recording mouse click events for GUI testing. What is more reliable than pixel coordinates?
                            
                                C++ invoke nested template class destructor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With