The type of a character literal is specified by the following rule:
A character literal that does not begin with u8, u, U, or L is an ordinary character literal. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
So consider the below example
#include <iostream>
int main(){
auto c = '\u0080';
std::cout<< typeid(c).name();
}
The type of c
is int
(reported by GCC). Why is the type of c
is int
?
According to the grammar of c-char, it's defined as:
c-char:
- any member of the source character set except the single-quote ', backslash , or new-line character
- escape-sequence
- universal-character-name
In this example, \u0080
is a universal-character-name which is a single c-char. So the ordinary character literal '\u0080'
does not contain more than one c-char. The default execution character set of GCC is UTF-8. That means, \u0080
is completely representable by the UTF-8 set. Why does GCC specify the type of c
to be int
? Although I know such a code point value cannot be represented by a char object, it's not what the above rule states. Is it a GCC bug or something I'm misunderstanding? How to interpret "be representable in the execution character set"?
The default execution character set of GCC is UTF-8.
And therein lies the problem. Namely, this is not true. Or at least, not in the way that the C++ standard means it.
The standard defines the "basic character set" as a collection of 96 different characters. However, it does not define an encoding for them. That is, the character "A" is part of the "basic character set". But the value of that character is not specified.
When the standard defines the "basic execution character set", it adds some characters to the basic set, but it also defines that there is a mapping from a character to a value. Outside of the NUL character being 0 however (and that the digits have to be encoded in a contiguous sequence), it lets implementations decide for themselves what that mapping is.
Here's the issue: UTF-8 is not a "character set" by any reasonable definition of that term.
Unicode is a character set; it defines a series of characters which exist and what their meanings are. It also each character in the Unicode character set a unique numeric value (a Unicode codepoint).
UTF-8 is... not that. UTF-8 is a scheme for encoding characters, typically in the Unicode character set (though it's not picky; it can work for any 21-bit number, and it can be extended to 32-bits).
So when GCC's documentation says:
[The execution character set] is under control of the user; the default is UTF-8, matching the source character set.
This statement makes no sense, since as previously stated, UTF-8 is a text encoding, not a character set.
What seems to have happened to GCC's documentation (and likely GCC's command line options) is that they've conflated the concept of "execution character set" with "narrow character encoding scheme". UTF-8 is how GCC encodes narrow character strings by default. But that's different from saying what its "execution character set" is.
That is, you can use UTF-8 to encode just the basic execution character set defined by C++. Using UTF-8 as your narrow character encoding scheme has no bearing on what your execution character set is.
Note that Visual Studio has a similarly-named option and makes a similar conflation of the two concepts. They call it the "execution character set", but they explain that the behavior of the option as:
The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps.
So... what is GCC's execution character set? Well, since their documentation has confused "execution character set" with "narrow string encoding", it's pretty much impossible to know.
So what does the standard require out of GCC's behavior? Well, take the rule you quoted and turn it around. A single universal-character-name in a character literal will either be a char
or an int
, and it will only be the latter if the universal-character-name names a character not in the execution character set. So it's impossible for a system's execution character set to include more characters than char
has bits to allow them.
That is, GCC's execution character set cannot be Unicode in its entirety. It must be some subset of Unicode. It can choose for it to be the subset of Unicode whose UTF-8 encoding takes up 1 char
, but that's about as big as it can be.
While I've framed this as GCC's problem, it's also technically a problem in the C++ specification. The paragraph you quoted also conflates the encoding mechanism (ie: what char
means) with the execution character set (ie: what characters are available to be stored).
This problem has been recognized and addressed by the addition of this wording:
A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit. A multicharacter literal is a character-literal whose c-char-sequence consists of more than one c-char. The encoding-prefix of a non-encodable character literal or a multicharacter literal shall be absent or L. Such character-literals are conditionally-supported.
As these are proposed (and accepted) as resolutions for CWG issues, they also retroactively apply to previous versions of the standard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With