The spec says that at phase 1 of compilation <blockquote> Any source ﬁle character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. </blockquote> And at phase 4 it says <blockquote> Preprocessing directives are executed, macro invocations are expanded </blockquote> At phase 5, we have <blockquote> Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set </blockquote> For the <code>#</code> operator, we have <blockquote> a <code>\</code> character is inserted before each <code>"</code> and <code>\</code> character of a character literal or string literal (including the delimiting <code>"</code> characters). </blockquote> Hence I conducted the following test <pre class="prettyprint"><code>#define GET_UCN(X) #X GET_UCN("€") </code></pre> With an input character set of UTF-8 (matching my file's encoding), I expected the following preprocessing result of the <code>#X</code> operation: <code>"\"\\u20AC\""</code>. GCC, Clang and boost.wave don't transform the <code>€</code> into a UCN and instead yield <code>"\"€\""</code>. I feel like I'm missing something. Can you please explain?

It's simply a bug. §2.1/1 says about Phase 1, <blockquote> (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.) </blockquote> This is not a note or footnote. C++0x adds an exception for raw string literals, which might solve your problem at hand if you have one. This program clearly demonstrates the malfunction: <pre class="prettyprint"><code>#include <iostream> #define GET_UCN(X) L ## #X int main() { std::wcout << GET_UCN("€") << '\n' << GET_UCN("\u20AC") << '\n'; } </code></pre> http://ideone.com/lb9jc Because both strings are wide, the first is required to be corrupted into several characters if the compiler fails to interpret the input multibyte sequence. In your given example, total lack of support for UTF-8 could cause the compiler to slavishly echo the sequence right through.

Why does stringizing an euro sign within a string literal using UTF8 not produce an UCN?

Tags:

c++

c-preprocessor

The spec says that at phase 1 of compilation

Any source ﬁle character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.

And at phase 4 it says

Preprocessing directives are executed, macro invocations are expanded

At phase 5, we have

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set

For the # operator, we have

a \ character is inserted before each " and \ character of a character literal or string literal (including the delimiting " characters).

Hence I conducted the following test

#define GET_UCN(X) #X
GET_UCN("€")

With an input character set of UTF-8 (matching my file's encoding), I expected the following preprocessing result of the #X operation: "\"\\u20AC\"". GCC, Clang and boost.wave don't transform the € into a UCN and instead yield "\"€\"". I feel like I'm missing something. Can you please explain?

876

asked Jun 24 '11 03:06

Johannes Schaub - litb

1 Answers

It's simply a bug. §2.1/1 says about Phase 1,

(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

This is not a note or footnote. C++0x adds an exception for raw string literals, which might solve your problem at hand if you have one.

This program clearly demonstrates the malfunction:

#include <iostream>

#define GET_UCN(X) L ## #X

int main() {
std::wcout << GET_UCN("€") << '\n' << GET_UCN("\u20AC") << '\n';
}

http://ideone.com/lb9jc

Because both strings are wide, the first is required to be corrupted into several characters if the compiler fails to interpret the input multibyte sequence. In your given example, total lack of support for UTF-8 could cause the compiler to slavishly echo the sequence right through.

106

answered Oct 04 '22 00:10

Potatoswatter

Related questions
                            
                                C memset seems to not write to every member
                            
                                How do ensure that while writing C++ code itself it will not cause any memory leaks?
                            
                                Is it time to say goodbye to VC6 compiler?
                            
                                How to check for division by 7 for big number in C++?
                            
                                D.R.Y vs "avoid macros"
                            
                                Cross platform programming [closed]
                            
                                Choosing between WPF, wxWidgets, Win32 API and MFC
                            
                                What is easiest way to create multithreaded applications with C/C++?
                            
                                How to convert std::wstring to numeric type(int, long, float)?
                            
                                Uninitialized memory blocks in VC++
                            
                                why does this work? (finding odd number in c++)
                            
                                Fastest possible algorithm to sum numbers up to N [closed]
                            
                                Which compiles to faster code: "n * 3" or "n+(n*2)"?
                            
                                Extending std::list
                            
                                Is there a better way to initialize an allocated array in C++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With