Quite a few concepts related to character set are mentioned in the standard: basic source character set, basic execution character set, basic execution wide-character set, execution character set, and execution wide-character set:
I don't have much question for basic source character set, basic execution character set, and basic execution wide-character set.
As for execution character set, the standard says it's implementation-defined and locale-specific, so I tried to get some real sense by observing the byte contents of a string-literal-initialized char array whose value should equal to the numerical value of the encoding of the characters in the execution character set (and a universal-character-name may map to more than one char element due to multibyte encoding):
char str[] = "Greek lowercase alpha is: \u03B1.";
It seems that it's almost always utf-8 on Linux (CE B1
is stored in the array for that Greek letter). On Windows, it's Windows-1252 if system locale is English (some wrong value 3F
is stored since Greek is not available in Windows-1252), and some other encoding for other locale (e.g. A6 C1
in cp936 for Chinese locale, E1
in Windows-1253 for Greek locale, representing Greek lowercase alpha in those two encodings respectively). For all those cases where the Greek letter is available in the locale (thus available in the execution character set), cout << str;
can print the Greek letter appropriately. All seems alright.
But for execution wide-character set, I don't understand very well. What is its exact encoding on major platforms? It seems that the ISO-10646 value 0x3B1
of the Greek lowercase alpha always gets stored in the wchar_t
for a declaration like the one below on all the platforms that I tried:
wchar_t wstr[] = L"Greek lowercase alpha is: \u03B1.";
So I guess the execution wide-charater set may well be UCS-2/UTF-16 or UTF-32 (different environment has different size for wchar_t
, 4 for Linux and 2 for Windows mostly)? However, wcout << wstr;
doesn't print the Greek letter correctly on either Linux or Windows. Surely the members and encoding of the execution wide-character set is implementation-defined, but that shouldn't be a problem for the implementation-provided iostream
facility to recognize and handle that appropriately, right? (While execution character set is also implementation-defined, the iostream
facility can handle it alright.) What is the default interpretation of a wchar_t
array when handled by iostream
facilities? (Anyway, just to clarify, I'm more interested in the nature of the execution wide-character set, rather than finding a correct way to print a wide-character string on certain platforms.)
PS: I'm a total novice for wchar_t
stuffs, so my apology if I said something very wrong.
The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps. This character set is used for the internal representation of any string or character literals in the compiled code.
What is Character Encoding? Character encoding tells computers how to interpret digital data into letters, numbers and symbols. This is done by assigning a specific numeric value to a letter, number or symbol. These letters, numbers, and symbols are classified as “characters”.
Character Sets and Encoding Schemes The distinction between the two isn't always clear and the terms tend to be used interchangeable. A character set is a list of characters whereas an encoding scheme is how they are represented in binary. This is best seen with Unicode.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
The execution wide-character set is simply the set of characters used to encode wchar_t, at runtime. See N3337 S2.3.
The encoding is implementation-defined. On all modern systems and platforms it would be Unicode (ISO-10646) but nothing makes it so. On older platforms such as IBM mainframe it might be DBCS or something different. You won't see it, but that's what the standard allows.
The EWCS is required to have some specific members and conversions. It is required to work correctly with the library functions. These are not tough restrictions.
The wide characters could actually be short int (as on Windows) or int 32 (as on Unix) and still be the same character set (Unicode).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With