I find in the new C++ Standard <pre class="prettyprint"><code>2.11 Identifiers [lex.name] identifier: identifier-nondigit identifier identifier-nondigit identifier digit identifier-nondigit: nondigit universal-character-name other implementation-defined character </code></pre> with the additional text <blockquote> An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. [...] </blockquote> I can not quite comprehend what this means. From the old std I am used to that a "universal character name" is written <code>\u89ab</code> for example. But using those in an identifier...? Really? Is the new standard more open w.r.t to Unicode? And I do not refer to the new Literal Types <code>"uHello \u89ab thing"u32</code>, I think I understood those. But: <ul> <li>Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?</li> <li>Can I write an identifier with <code>\u1234</code> in it <code>myfu\u1234ntion</code> (for whatever purpose)</li> <li> Or can i use the "character names" that unicode defines like in the ICU, i.e. <pre class="prettyprint"><code>const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32; </code></pre> or even in an identifier in the source itself? That would be a treat... cough... </li> </ul> I think the answer to all thise questions is no but I can not map this reliably to the wording in the standard... :-) Edit: I found "2.2 Phases of translation [lex.phases]", Phase 1: <blockquote> Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.) </blockquote> By reading this I now think, that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its <code>\uNNNN</code> notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other <code>\uNNNN</code> the same way. What do you think?

I think the intent is to allow Unicode characters in identifiers, such as: <pre class="prettyprint"><code>long pöjk; ostream* å; </code></pre>

I suggest using <code>clang++</code> instead of <code>g++</code>. Clang is designed to be highly compatible with GCC (wikipedia-source), so you can most likely just substitute that command. I wanted to use Greek symbols in my source code. If code readability is the goal, then it seems reasonable to use (for example) <code>α</code> over <code>alpha</code>. Especially when used in larger mathematical formulas, they can be read more easily in the source code. To achieve this, this is a minimal working example: <pre class="prettyprint"><code>> cat /tmp/test.cpp #include <iostream> int main() { int α = 10; std::cout << "α = " << α << std::endl; return 0; } > clang++ /tmp/test.cpp -o /tmp/test > /tmp/test α = 10 </code></pre>

Present versions of gcc (up to version 5.2 so far) only support ASCII and in some cases EBCDIC input files. Therefore, unicode characters in identifiers have to be represented using \uXXXX and \UXXXXXXXX escape sequences in ASCII encoded files. While it may be possible to represent unicode characters as ??/uXXXX and ??/UXXXXXXX in EBCDIC encoded input files, I have not tested this. At anyrate, a simple one-line patch to cpp allows direct reading of UTF-8 input provided a recent version of iconv is installed. Details are in https://www.raspberrypi.org/forums/viewtopic.php?p=802657 and may be summarized by the patch <pre class="prettyprint"><code>diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c *** gcc-5.2.0/libcpp/charset.c Mon Jan 5 04:33:28 2015 --- gcc-5.2.0-ejo/libcpp/charset.c Wed Aug 12 14:34:23 2015 *************** *** 1711,1717 **** struct _cpp_strbuf to; unsigned char *buffer; ! input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset); if (input_cset.func == convert_no_conversion) { to.text = input; --- 1711,1717 ---- struct _cpp_strbuf to; unsigned char *buffer; ! input_cset = init_iconv_desc (pfile, "C99", input_charset); if (input_cset.func == convert_no_conversion) { to.text = input; </code></pre>

Unicode Identifiers and Source Code in C++11?

Tags:

syntax

c++11

unicode

I find in the new C++ Standard

2.11 Identifiers                  [lex.name]
identifier:
    identifier-nondigit
    identifier identifier-nondigit
    identifier digit
identifier-nondigit:
    nondigit
    universal-character-name
    other implementation-defined character

with the additional text

An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. [...]

I can not quite comprehend what this means. From the old std I am used to that a "universal character name" is written \u89ab for example. But using those in an identifier...? Really?

Is the new standard more open w.r.t to Unicode? And I do not refer to the new Literal Types "uHello \u89ab thing"u32, I think I understood those. But:

Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)
Or can i use the "character names" that unicode defines like in the ICU, i.e.
```
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;
```
or even in an identifier in the source itself? That would be a treat... cough...

I think the answer to all thise questions is no but I can not map this reliably to the wording in the standard... :-)

Edit: I found "2.2 Phases of translation [lex.phases]", Phase 1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

By reading this I now think, that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its \uNNNN notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other \uNNNN the same way.

What do you think?

729

asked Apr 15 '11 12:04

towi

5 Answers

Is the new standard more open w.r.t to Unicode?

With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk "The Future of C++" that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)

It's not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:

long pörk;

not:

long p\u00F6rk;

However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).

UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal "°" in source code that will be compiled both as CP1252 and as UTF-8:

char const *degree_sign = "\u00b0";

This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.

Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?

It's not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.

(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)

Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)

Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.

Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;

No, you cannot use Unicode long names.

or even in an identifier in the source itself? That would be a treat... cough...

If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.

155

answered Oct 10 '22 20:10

bames53

I think the intent is to allow Unicode characters in identifiers, such as:

long pöjk;
ostream* å;

answered Oct 10 '22 20:10

dalle

I suggest using clang++ instead of g++. Clang is designed to be highly compatible with GCC (wikipedia-source), so you can most likely just substitute that command.

I wanted to use Greek symbols in my source code. If code readability is the goal, then it seems reasonable to use (for example) α over alpha. Especially when used in larger mathematical formulas, they can be read more easily in the source code.

To achieve this, this is a minimal working example:

> cat /tmp/test.cpp
#include <iostream>

int main()
{
    int α = 10;
    std::cout << "α = " << α << std::endl;
    return 0;
}
> clang++ /tmp/test.cpp -o /tmp/test
> /tmp/test 
α = 10

answered Oct 10 '22 21:10

Yeti

This article https://www.securecoding.cert.org/confluence/display/seccode/PRE30-C.+Do+not+create+a+universal+character+name+through+concatenation works with the idea that int \u0401; is compliant code, though it's based on C99, instead of C++0x.

answered Oct 10 '22 21:10

Mooing Duck

Present versions of gcc (up to version 5.2 so far) only support ASCII and in some cases EBCDIC input files. Therefore, unicode characters in identifiers have to be represented using \uXXXX and \UXXXXXXXX escape sequences in ASCII encoded files. While it may be possible to represent unicode characters as ??/uXXXX and ??/UXXXXXXX in EBCDIC encoded input files, I have not tested this. At anyrate, a simple one-line patch to cpp allows direct reading of UTF-8 input provided a recent version of iconv is installed. Details are in

https://www.raspberrypi.org/forums/viewtopic.php?p=802657

and may be summarized by the patch

diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c
*** gcc-5.2.0/libcpp/charset.c  Mon Jan  5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c  Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;
--- 1711,1717 ----
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, "C99", input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;

answered Oct 10 '22 20:10

ejolson

Related questions
                            
                                What is the preferred way to compare two Java Strings lexicographically on *Unicode code points*?
                            
                                '4' and '４' clash in primary key but not in filesystem
                            
                                What does `<cuchar>` provide, and where is it documented?
                            
                                Golang complex fold grüßen
                            
                                Unicode support in C++0x
                            
                                What is unicode character of location?
                            
                                ToAscii/ToUnicode in a keyboard hook destroys dead keys
                            
                                tr [:upper:] [:lower:] with Cyrillic text
                            
                                Which subset of Unicode symbols should I use to mark special substrings in text?
                            
                                Is it bad practice to use unicode symbols or shapes in a  app?
                            
                                Strings and character encoding in C++
                            
                                Unicode character color issue
                            
                                Passing command line unicode argument to Java code
                            
                                Unicode string normalization in C/C++
                            
                                How can I relate Unicode blocks to Languages/Scripts?
                            
                                Python os.stat and unicode file names
                            
                                How to make Python 2.x Unicode strings not print as u'string'?
                            
                                Unicode in javadoc and comments?
                            
                                How do I compare each character of a String while accounting for characters with length > 1?
                            
                                Python 3 print() function with Farsi/Arabic characters [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode Identifiers and Source Code in C++11?

Tags:

syntax

c++11

unicode

towi

People also ask

5 Answers

bames53

dalle

Yeti

Mooing Duck

ejolson

Recent Activity

Donate For Us