<h3>Context</h3> C11 and C++11 both support extended characters in source files, as well as Universal Character Names (UCNs), which allow one to enter characters not in the basic source character set using only characters that are. C++11 also defines several translation phases of compilation. In particular, extended characters are normalized to UCNs in the very first phase of translation, described below: § C++11 2.2p1.1: <blockquote> Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.) </blockquote> <hr> <h3>Question</h3> My question, therefore, is: <blockquote> Does a Standard-conforming compilation of the program <pre class="prettyprint"><code>#include <stdio.h> int main(void){ printf("\é\n"); printf("\\u00e9\n"); return 0; } </code></pre> fail, compile and print <pre class="prettyprint"><code>é é </code></pre> or compile and print <pre class="prettyprint"><code>\u00e9 \u00e9 </code></pre> , when run? </blockquote> <hr> <h3>Informed Personal Opinion</h3> It is my contention that the answer is that it successfully compiles and prints <code>\u00e9</code>, since by §2.2p1.1 above, we have An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal., and we are not in a raw string literal. It then follows that <ul> <li>In Phase 1, <code>printf("\é\n");</code> maps to <code>printf("\\u00e9\n");</code>.</li> <li>In Phase 3, The source file is decomposed into preprocessing tokens (§2.2p1.3), of which the string-literal <code>"\\u00e9\n"</code> is one.</li> <li>In Phase 5, Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set (§2.2p1.5). Thus, by the maximal munch principle, <code>\\</code> maps to <code>\</code>, and the fragment <code>u00e9</code> is not recognized as a UCN and therefore prints as is.</li> </ul> <h3>Experiments</h3> Unfortunately, extant compilers disagree with me. I've tested with both GCC 4.8.2 and Clang 3.5, and here is what they gave me: <ul> <li> GCC 4.8.2 <pre class="prettyprint"><code>$ g++ -std=c++11 -Wall -Wextra ucn.cpp -o ucn ucn.cpp: In function 'int main()': ucn.cpp:4:9: warning: unknown escape sequence: '\303' [enabled by default] printf("\é\n"); ^ $ ./ucn é \u00e9 </code></pre> </li> <li> Clang 3.5 <pre class="prettyprint"><code>$ clang++ -std=c++11 -Wall -Wextra ucn.cpp -o ucn ucn.cpp:4:10: warning: unknown escape sequence '\xFFFFFFC3' [-Wunknown-escape-sequence] printf("\é\n"); ^ ucn.cpp:4:12: warning: illegal character encoding in string literal [-Winvalid-source-encoding] printf("\é\n"); ^ 2 warnings generated. $ ./ucn é \u00e9 </code></pre> </li> </ul> I have double- and -triple checked that the <code>é</code> character appears as <code>C3 A9</code> using <code>hexdump -C ucn.cpp</code>, in agreement with the expected UTF-8 encoding. I've moreover verified that a plain <code>printf("é\n");</code> or <code>printf("\u00e9\n");</code> works flawlessly, so this is not a problem of the compilers tested being unable to read UTF-8 source files. Who's right?

You seem to be confused, thinking that <code>\\u00e9</code> is a UCN -- it is not. UCNs all begin with <code>\u</code>, and in your case, you have an extract backslash which escapes this initial backslash. So <code>\\u00e9</code> is the sequence of 6 characters: <code>\</code>, <code>u</code>, <code>0</code>, <code>0</code>, <code>e</code>, <code>9</code>. edit <blockquote> In Phase 1, printf("\é\n"); maps to printf("\u00e9\n");. </blockquote> This is where you're going wrong -- Phase 1 translates input characters into source characters, so <code>printf("\é\n");</code> maps to <code>p</code> <code>r</code> <code>i</code> <code>n</code> <code>t</code> <code>f</code> <code>(</code> <code>"</code> <code>\</code> <code>é</code> <code>\</code> <code>n</code> <code>"</code> <code>)</code> <code>;</code>, which is the same as <code>p</code> <code>r</code> <code>i</code> <code>n</code> <code>t</code> <code>f</code> <code>(</code> <code>"</code> <code>\</code> <code>\u00e9</code> <code>\</code> <code>n</code> <code>"</code> <code>)</code> <code>;</code>, but that's not the same as what <code>printf("\\u00e9\n");</code> maps to because of the double-backslash in the latter. Because of the special handling of double-backslash, there is no way to have a backslash followed by a UCN in the source.

C11 & C++11 Exended and Universal Character Escaping

Context

C11 and C++11 both support extended characters in source files, as well as Universal Character Names (UCNs), which allow one to enter characters not in the basic source character set using only characters that are.

C++11 also defines several translation phases of compilation. In particular, extended characters are normalized to UCNs in the very first phase of translation, described below:

§ C++11 2.2p1.1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

Question

My question, therefore, is:

Does a Standard-conforming compilation of the program
#include <stdio.h>

int main(void){
 printf("\é\n");
 printf("\\u00e9\n");
 return 0;
}
fail, compile and print
é
é
or compile and print
\u00e9
\u00e9
, when run?

Informed Personal Opinion

It is my contention that the answer is that it successfully compiles and prints \u00e9, since by §2.2p1.1 above, we have

An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal., and we are not in a raw string literal.

It then follows that

In Phase 1, printf("\é\n"); maps to printf("\\u00e9\n");.
In Phase 3, The source file is decomposed into preprocessing tokens (§2.2p1.3), of which the string-literal "\\u00e9\n" is one.
In Phase 5, Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set (§2.2p1.5). Thus, by the maximal munch principle, \\ maps to \, and the fragment u00e9 is not recognized as a UCN and therefore prints as is.

Experiments

Unfortunately, extant compilers disagree with me. I've tested with both GCC 4.8.2 and Clang 3.5, and here is what they gave me:

GCC 4.8.2

$ g++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
ucn.cpp: In function 'int main()':
ucn.cpp:4:9: warning: unknown escape sequence: '\303' [enabled by default]
  printf("\é\n");
         ^
$ ./ucn
é
\u00e9

Clang 3.5

$ clang++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
ucn.cpp:4:10: warning: unknown escape sequence '\xFFFFFFC3' [-Wunknown-escape-sequence]
        printf("\é\n");
                ^
ucn.cpp:4:12: warning: illegal character encoding in string literal [-Winvalid-source-encoding]
        printf("\é\n");
                 ^
2 warnings generated.
$ ./ucn
é
\u00e9

I have double- and -triple checked that the é character appears as C3 A9 using hexdump -C ucn.cpp, in agreement with the expected UTF-8 encoding. I've moreover verified that a plain printf("é\n"); or printf("\u00e9\n"); works flawlessly, so this is not a problem of the compilers tested being unable to read UTF-8 source files.

Who's right?

746

asked May 10 '15 16:05

Iwillnotexist Idonotexist

2 Answers

'é' is not a valid character to backslash escape in a string literal, and so a backslash followed by 'é' as either a literal source character or a UCN should produce a compiler diagnostic and undefined behavior.

Note, however, that "\\u00e9" is not a UCN preceded by a backslash, and that it's not possible to write any sequence of basic source characters in a string or character literal that is a backslash followed by a UCN. Thus "\é" and "\\u00e9" are not required to behave the same: The behavior of "\\u00e9" can be perfectly well defined while the behavior of "\é" is undefined.

If we were to posit some syntax that allowed backslash escaping a UCN, say "\«\u00e9»", then that would have undefined behavior like "\é".

In Phase 1, printf("\é\n"); maps to printf("\\u00e9\n");.

The phase one conversion of é into a UCN cannot create a non-UCN, such as "\\u00e9".

The compilers are right, but don't specifically handle this situation with perfect diagnostic messages. Ideally what you'd get is:

$ clang++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
ucn.cpp:4:10: warning: unknown escape sequence '\é' [-Wunknown-escape-sequence]
        printf("\é\n");
                ^
1 warnings generated.
$ ./ucn
é
\u00e9

Both compilers specify that their behavior in the presence of an unknown escape sequence is to replace the escape sequence with the character thus escaped, so "\é" would be treated as "é" and the program overall should be interpreted as:

#include <stdio.h>

int main(void){
        printf("é\n");
        printf("\\u00e9\n");
        return 0;
}

Both compilers do happen to get this behavior, partially by chance, but also partially because the policy to treat unrecognized escape sequences the way they do is a smart choice: Even though they only see the unrecognized escape sequence as the backslash followed by the byte 0xC3, they remove the backslash and leave the 0xC3 in place, which means the UTF-8 sequence is left intact for later processing.

152

answered Oct 12 '22 05:10

bames53

You seem to be confused, thinking that \\u00e9 is a UCN -- it is not. UCNs all begin with \u, and in your case, you have an extract backslash which escapes this initial backslash. So \\u00e9 is the sequence of 6 characters: \, u, 0, 0, e, 9.

edit

In Phase 1, printf("\é\n"); maps to printf("\u00e9\n");.

This is where you're going wrong -- Phase 1 translates input characters into source characters, so printf("\é\n"); maps to p r i n t f ( " \ é \ n " ) ;, which is the same as p r i n t f ( " \ \u00e9 \ n " ) ;, but that's not the same as what printf("\\u00e9\n"); maps to because of the double-backslash in the latter. Because of the special handling of double-backslash, there is no way to have a backslash followed by a UCN in the source.

answered Oct 12 '22 04:10

Chris Dodd

Related questions
                            
                                Using GNU Scientific Library (GSL) to draw a 2D B-Spline path using unevenly spaced points
                            
                                What happened to the "real" Cassandra C++ library libcql?
                            
                                Character classification
                            
                                Is the article Generic<Programming> Typed Buffers completely obsolete with C++ 11?
                            
                                Placement new and inheritance
                            
                                Detect and Remove Hidden Surfaces of a Mesh
                            
                                Can I get an XML AST of C/C++/Java code without compiling it?
                            
                                constexpr returning array, gcc warning
                            
                                Space complexity of C++ STL containers
                            
                                How is floating point overflow handled in iostreams
                            
                                How to generate .pch for lots of headers?
                            
                                Video captured by Media Foundation is vertical mirrorred
                            
                                Why is there no [] operator for std::multimap?
                            
                                Mixing libstdc++ versions
                            
                                Why isn't __clang__ defined when using LLVM+Clang in Visual Studio?
                            
                                C++ enforce second-pass name lookup in template function
                            
                                gcc-4.9.2: non-type template parameter
                            
                                Is providing a private constructor for initializer_list conforming?
                            
                                Why my code is much slower than opencv for a simple StereoBM algorithm?
                            
                                QPushButton has duplicated text after Qt upgrade

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C11 & C++11 Exended and Universal Character Escaping

Tags:

c++

c

c++11

language-lawyer

c11