Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is execution wide-character set and its encoding?

Quite a few concepts related to character set are mentioned in the standard: basic source character set, basic execution character set, basic execution wide-character set, execution character set, and execution wide-character set:

  • Basic source character set: 91 graphical characters, plus space character, HT, VT, FF, LF (just borrowing name abbreviations from ASCII).
  • Basic execution (wide-)character set: all members of basic source character set, plus BEL, BS, CR, (wide-)NUL.
  • The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

I don't have much question for basic source character set, basic execution character set, and basic execution wide-character set.

As for execution character set, the standard says it's implementation-defined and locale-specific, so I tried to get some real sense by observing the byte contents of a string-literal-initialized char array whose value should equal to the numerical value of the encoding of the characters in the execution character set (and a universal-character-name may map to more than one char element due to multibyte encoding):

char str[] = "Greek lowercase alpha is: \u03B1.";

It seems that it's almost always utf-8 on Linux (CE B1 is stored in the array for that Greek letter). On Windows, it's Windows-1252 if system locale is English (some wrong value 3F is stored since Greek is not available in Windows-1252), and some other encoding for other locale (e.g. A6 C1 in cp936 for Chinese locale, E1 in Windows-1253 for Greek locale, representing Greek lowercase alpha in those two encodings respectively). For all those cases where the Greek letter is available in the locale (thus available in the execution character set), cout << str; can print the Greek letter appropriately. All seems alright.

But for execution wide-character set, I don't understand very well. What is its exact encoding on major platforms? It seems that the ISO-10646 value 0x3B1 of the Greek lowercase alpha always gets stored in the wchar_t for a declaration like the one below on all the platforms that I tried:

wchar_t wstr[] = L"Greek lowercase alpha is: \u03B1."; 

So I guess the execution wide-charater set may well be UCS-2/UTF-16 or UTF-32 (different environment has different size for wchar_t, 4 for Linux and 2 for Windows mostly)? However, wcout << wstr; doesn't print the Greek letter correctly on either Linux or Windows. Surely the members and encoding of the execution wide-character set is implementation-defined, but that shouldn't be a problem for the implementation-provided iostream facility to recognize and handle that appropriately, right? (While execution character set is also implementation-defined, the iostream facility can handle it alright.) What is the default interpretation of a wchar_t array when handled by iostream facilities? (Anyway, just to clarify, I'm more interested in the nature of the execution wide-character set, rather than finding a correct way to print a wide-character string on certain platforms.)

PS: I'm a total novice for wchar_t stuffs, so my apology if I said something very wrong.

like image 882
goodbyeera Avatar asked Feb 26 '14 05:02

goodbyeera


People also ask

What is execution character set?

The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps. This character set is used for the internal representation of any string or character literals in the compiled code.

What is the meaning of character encoding?

What is Character Encoding? Character encoding tells computers how to interpret digital data into letters, numbers and symbols. This is done by assigning a specific numeric value to a letter, number or symbol. These letters, numbers, and symbols are classified as “characters”.

What is the difference between encoding and character set?

Character Sets and Encoding Schemes The distinction between the two isn't always clear and the terms tend to be used interchangeable. A character set is a list of characters whereas an encoding scheme is how they are represented in binary. This is best seen with Unicode.

Is UTF-8 character set or encoding?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”


1 Answers

The execution wide-character set is simply the set of characters used to encode wchar_t, at runtime. See N3337 S2.3.

The encoding is implementation-defined. On all modern systems and platforms it would be Unicode (ISO-10646) but nothing makes it so. On older platforms such as IBM mainframe it might be DBCS or something different. You won't see it, but that's what the standard allows.

The EWCS is required to have some specific members and conversions. It is required to work correctly with the library functions. These are not tough restrictions.

The wide characters could actually be short int (as on Windows) or int 32 (as on Unix) and still be the same character set (Unicode).

like image 110
david.pfx Avatar answered Sep 17 '22 12:09

david.pfx