Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is '\u0B95' a multicharacter literal?

In a previous answer I gave, I responded to the following warning being caused by the fact that '\u0B95' requires three bytes and so is a multicharacter literal:

warning: multi-character character constant [-Wmultichar]

But actually, I don't think I'm right and I don't think gcc is either. The standard states:

An ordinary character literal that contains more than one c-char is a multicharacter literal.

One production rule for c-char is a universal-character-name (i.e. \uXXXX or \UXXXXXXXX). Since \u0B95 is a single c-char, this is not a multicharacter literal. But now it gets messy. The standard also says:

An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.

So my literal has type char and value of the character in the execution character set (or implementation-defined value if it does not exist in that set). char is only defined to be large enough to store any member of the basic character set (which is not actually defined by the standard, but I assume it means the basic execution character set):

Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set.

Therefore, since the execution character set is a superset of all the values a char can hold, my character may not fit in the char.

So what value does my char have? This doesn't seem to be defined anywhere. The standard does say that for char16_t literals, if the value is not representable, the program is ill-formed. It says nothing about ordinary literals, though.

So what's going on? Is this just a mess in the standard or am I missing something?

like image 648
Joseph Mansfield Avatar asked Nov 25 '12 01:11

Joseph Mansfield


People also ask

Which of the following is an example of character literal?

A character literal contains a sequence of characters or escape sequences enclosed in single quotation mark symbols, for example 'c' . A character literal may be prefixed with the letter L, for example L'c' . A character literal without the L prefix is an ordinary character literal or a narrow character literal.

What is a character literal constant?

A character literal is composed of a constant character. It's represented by the character surrounded by single quotation marks. There are five kinds of character literals: Ordinary character literals of type char , for example 'a'

What is multi-character character constant?

Multi-character constants can consist of as many as four characters. For example, the constant '\006\007\008\009' is valid only in a C++Builder program. Multi-character constants are always 32-bit int values. The constants are not portable to other C++ compilers.

What is a character constant in C++?

A character constant is one or more characters enclosed in single quotes, such as 'A' , '+' , or '\n' . In C, single-character constants have data type int. In C++, a character constant has type char. Multi-character constants in both C and C++ have data type int.


2 Answers

I would argue as follows:

The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for literals with no prefix)... (From section 2.14.3.4)

If '\u0B95' falls outside of the implementation-defined range defined for char (which it would if char is 8 bits), it's value is then implementation defined, at which point GCC can make its value a sequence of multiple c-chars, thus becoming a multicharacter literal.

like image 110
Cornstalks Avatar answered Sep 29 '22 18:09

Cornstalks


Somebody posted an answer that correctly answered the second part of my question (what value will the char have?) but has since deleted their post. Since that part was correct, I'll reproduce it here together with my answer for the first part (is it a multicharacter literal?).


'\u0B95' is not a multicharacter literal and gcc is mistaken here. As stated in the question, a multicharacter literal is defined by (§2.14.3/1):

An ordinary character literal that contains more than one c-char is a multicharacter literal.

Since a universal-character-name is one expansion of a c-char, the literal '\u0B95' contains only one c-char. It would make sense if ordinary literals could not contain a universal-character-name for \u0B95 to be considered as six seperate characters (\, u, 0, etc.) but I cannot find this restriction anywhere. Therefore, it is a single character and the literal is not a multicharacter literal.

To further support this, why would it be considered to be multiple characters? At this point we haven't even given it an encoding so we don't know how many bytes it would take up. In UTF-16 it would take 2 bytes, in UTF-8 it would take 3 bytes and in some imagined encoding it could take just 1 byte.

So what value will the character literal have? First the universal-character-name is mapped to the corresponding encoding in the execution character set, unless it has not mapping in which case it has implementation-defined encoding (§2.14.3/5):

A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation-defined encoding.

Either way, the char literal gets the value equal to the numerical value of the encoding (§2.14.3/1):

An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.

Now the important part, inconveniently tucked away in a different paragraph further in the section. If the value can not be represented in the char, it gets an implementation-defined value (§2.14.3/4):

The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for literals with no prefix) ...

like image 29
Joseph Mansfield Avatar answered Sep 29 '22 20:09

Joseph Mansfield