Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

πŸ˜ƒ (and other Unicode characters) in identifiers not allowed by g++

I am 😞 to find that I cannot use πŸ˜ƒ as a valid identifier with g++ 4.7, even with the -fextended-identifiers option enabled:

int main(int argc, const char* argv[]) {   const char* πŸ˜ƒ = "I'm very happy";   return 0; } 

main.cpp:3:3: error: stray β€˜\360’ in program
main.cpp:3:3: error: stray β€˜\237’ in program
main.cpp:3:3: error: stray β€˜\230’ in program
main.cpp:3:3: error: stray β€˜\203’ in program

After some googling, I discovered that UTF-8 characters are not yet supported in identifiers, but a universal-character-name should work. So I convert my source to:

int main(int argc, const char* argv[]) {   const char* \U0001F603 = "I'm very happy";   return 0; } 

main.cpp:3:15: error: universal character \U0001F603 is not valid in an identifier

So apparently πŸ˜ƒ isn't a valid identifier character. However, the standard specifically allows characters from the range 10000-1FFFD in Annex E.1 and doesn't disallow it as an initial character in E.2.

My next effort was to see if any other allowed Unicode characters worked - but none that I tried did. Not even the ever important PILE OF POO (πŸ’©) character.

So, for the sake of meaningful and descriptive variable names, what gives? Does -fextended-identifiers do as it advertises or not? Is it only supported in the very latest build? And what kind of support do other compilers have?

like image 566
Joseph Mansfield Avatar asked Oct 02 '12 14:10

Joseph Mansfield


People also ask

What characters are allowed in an identifier?

Only alphabetic characters, numeric digits, and the underscore character (_) are legal in an identifier. The first character of an identifier must be alphabetic or an underscore (it cannot be a numeric digit).

Which special character is not allowed for giving name of identifiers?

Identifiers are case-sensitive in the C language. For example, name and Name will be treated as two different identifiers. Keywords are not allowed to be used as Identifiers. No special characters, such as a semicolon, period, whitespaces, slash, or comma are permitted to be used in or as an Identifier.

What is an example of a Unicode character?

The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

How do I identify Unicode characters?

Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. Note, that ASCII includes only the English alphabet.


2 Answers

As of 4.8, gcc does not support characters outside of the BMP used as identifiers. It seems to be an unnecessary restriction. Also, gcc only supports a very restricted set of character described in ucnid.tab, based on C99 and C++98 (it is not updated to C11 and C++11 yet, it seems).

As described in the manual, -fextended-identifiers is experimental, so it has a higher chance won't work as expected.


Edit:

GCC supported the C11 character set starting from 4.9.0 (svn r204886 to be precise). So OP's second piece of code using \U0001F603 does work. I still can't get the actual code using πŸ˜ƒ to work even with -finput-charset=UTF-8 with GCC 8.2 on https://gcc.godbolt.org though (You may want to follow this bug report, provided by @DanielWolf).

Meanwhile both pieces of code work on clang 3.3 without any options other than -std=c++11.

like image 123
kennytm Avatar answered Nov 12 '22 02:11

kennytm


This was a known bug in GCC 9 and before. This has been fixed in GCC 10.

The official changelog for GCC 10 contains this section:

Extended characters in identifiers may now be specified directly in the input encoding (UTF-8, by default), in addition to the UCN syntax (\uNNNN or \UNNNNNNNN) that is already supported:

static const int Ο€ = 3; int get_naΓ―ve_pi() {   return Ο€; } 
like image 41
Daniel Wolf Avatar answered Nov 12 '22 03:11

Daniel Wolf