Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C11 & C++11 Exended and Universal Character Escaping

Context

C11 and C++11 both support extended characters in source files, as well as Universal Character Names (UCNs), which allow one to enter characters not in the basic source character set using only characters that are.

C++11 also defines several translation phases of compilation. In particular, extended characters are normalized to UCNs in the very first phase of translation, described below:

§ C++11 2.2p1.1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)


Question

My question, therefore, is:

Does a Standard-conforming compilation of the program

#include <stdio.h>

int main(void){
        printf("\é\n");
        printf("\\u00e9\n");
        return 0;
}

fail, compile and print

é
é

or compile and print

\u00e9
\u00e9

, when run?


Informed Personal Opinion

It is my contention that the answer is that it successfully compiles and prints \u00e9, since by §2.2p1.1 above, we have

An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal., and we are not in a raw string literal.

It then follows that

  • In Phase 1, printf("\é\n"); maps to printf("\\u00e9\n");.
  • In Phase 3, The source file is decomposed into preprocessing tokens (§2.2p1.3), of which the string-literal "\\u00e9\n" is one.
  • In Phase 5, Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set (§2.2p1.5). Thus, by the maximal munch principle, \\ maps to \, and the fragment u00e9 is not recognized as a UCN and therefore prints as is.

Experiments

Unfortunately, extant compilers disagree with me. I've tested with both GCC 4.8.2 and Clang 3.5, and here is what they gave me:

  • GCC 4.8.2

    $ g++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
    ucn.cpp: In function 'int main()':
    ucn.cpp:4:9: warning: unknown escape sequence: '\303' [enabled by default]
      printf("\é\n");
             ^
    $ ./ucn
    é
    \u00e9
    
  • Clang 3.5

    $ clang++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
    ucn.cpp:4:10: warning: unknown escape sequence '\xFFFFFFC3' [-Wunknown-escape-sequence]
            printf("\é\n");
                    ^
    ucn.cpp:4:12: warning: illegal character encoding in string literal [-Winvalid-source-encoding]
            printf("\é\n");
                     ^
    2 warnings generated.
    $ ./ucn
    é
    \u00e9
    

I have double- and -triple checked that the é character appears as C3 A9 using hexdump -C ucn.cpp, in agreement with the expected UTF-8 encoding. I've moreover verified that a plain printf("é\n"); or printf("\u00e9\n"); works flawlessly, so this is not a problem of the compilers tested being unable to read UTF-8 source files.

Who's right?

like image 746
Iwillnotexist Idonotexist Avatar asked May 10 '15 16:05

Iwillnotexist Idonotexist


People also ask

Is realme C11 3GB RAM?

Realme C11 (32 Gb, 3G Ram) Android 10 realme UI 1.0 | Octa-core PowerVR GE8320 | Dual 13 MP,2 MP LED Flash, 1080p@30fps | Li-Po 5000 mAh | International Model (GSM Compatible) (Green)

How much is realme C11 2022 in the Philippines?

The cheapest price of Realme C11 2021 in Philippines is PHP 4499. The Realme C11 2021 features a 6.52" display, 13 + 2MP back camera, 5MP front camera, and a 5000mAh battery capacity. It is powered by the Octa Core CPU and runs on Android.

Is C11 same as C12?

Realme C12 vs Realme C11: SpecificationsBoth are powered by the octa-core MediaTek Helio G35 SoC and while the Realme C12 comes with 3GB of RAM, the Realme C11 has 2GB.

How much does a C11 cost?

Realme C11 Cheapest Price and Key Features The cheapest price of Realme C11 in Philippines is PHP1205 from Shopee. The Realme C11 features a 6.5" display, 13 + 2MP back camera, 5MP front camera, and a 5000mAh battery capacity. It is powered by a Octa Core CPU and runs on Android. Release date of the Realme C11 is 2020.


2 Answers

'é' is not a valid character to backslash escape in a string literal, and so a backslash followed by 'é' as either a literal source character or a UCN should produce a compiler diagnostic and undefined behavior.

Note, however, that "\\u00e9" is not a UCN preceded by a backslash, and that it's not possible to write any sequence of basic source characters in a string or character literal that is a backslash followed by a UCN. Thus "\é" and "\\u00e9" are not required to behave the same: The behavior of "\\u00e9" can be perfectly well defined while the behavior of "\é" is undefined.

If we were to posit some syntax that allowed backslash escaping a UCN, say "\«\u00e9»", then that would have undefined behavior like "\é".


  • In Phase 1, printf("\é\n"); maps to printf("\\u00e9\n");.

The phase one conversion of é into a UCN cannot create a non-UCN, such as "\\u00e9".


The compilers are right, but don't specifically handle this situation with perfect diagnostic messages. Ideally what you'd get is:

$ clang++ -std=c++11  -Wall -Wextra ucn.cpp -o ucn
ucn.cpp:4:10: warning: unknown escape sequence '\é' [-Wunknown-escape-sequence]
        printf("\é\n");
                ^
1 warnings generated.
$ ./ucn
é
\u00e9

Both compilers specify that their behavior in the presence of an unknown escape sequence is to replace the escape sequence with the character thus escaped, so "\é" would be treated as "é" and the program overall should be interpreted as:

#include <stdio.h>

int main(void){
        printf("é\n");
        printf("\\u00e9\n");
        return 0;
}

Both compilers do happen to get this behavior, partially by chance, but also partially because the policy to treat unrecognized escape sequences the way they do is a smart choice: Even though they only see the unrecognized escape sequence as the backslash followed by the byte 0xC3, they remove the backslash and leave the 0xC3 in place, which means the UTF-8 sequence is left intact for later processing.

like image 152
bames53 Avatar answered Oct 12 '22 05:10

bames53


You seem to be confused, thinking that \\u00e9 is a UCN -- it is not. UCNs all begin with \u, and in your case, you have an extract backslash which escapes this initial backslash. So \\u00e9 is the sequence of 6 characters: \, u, 0, 0, e, 9.

edit

In Phase 1, printf("\é\n"); maps to printf("\u00e9\n");.

This is where you're going wrong -- Phase 1 translates input characters into source characters, so printf("\é\n"); maps to p r i n t f ( " \ é \ n " ) ;, which is the same as p r i n t f ( " \ \u00e9 \ n " ) ;, but that's not the same as what printf("\\u00e9\n"); maps to because of the double-backslash in the latter. Because of the special handling of double-backslash, there is no way to have a backslash followed by a UCN in the source.

like image 31
Chris Dodd Avatar answered Oct 12 '22 04:10

Chris Dodd