You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with <code>u8</code>/<code>u</code>/<code>U</code> respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful. I understand you can still escape single unicode characters with <code>\uNNNN</code>, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character. What I understand from various sources is that <code>u</code> should become equivalent to <code>L</code> on current Windows implementations and <code>U</code> on e.g. Linux implementations. So with that in mind, I'm also wondering what the required behavior is for the old string literal modifiers... For the code-sample monkeys: <pre class="prettyprint"><code>string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!"; string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!"; string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!"; </code></pre> In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.

<blockquote> How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals. I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful. </blockquote> From n3290, 2.2 Phases of translation [lex.phases] <blockquote> Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. [Here's a bit about trigraphs.] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.) </blockquote> There are a lot of Standard terms being used to describe how an implementation deals with encodings. Here's my attempt at as somewhat simpler, step-by-step description of what happens: <blockquote> Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] </blockquote> The issue of file encodings is handwaved; the Standard only cares about the basic source character set and leaves room for the implementation to get there. <blockquote> Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. </blockquote> The basic source set is a simple list of allowed characters. It is not ASCII (see further). Anything not in this list is 'transformed' (conceptually at least) to a <code>\uXXXX</code> form. So no matter what kind of literal or file encoding is used, the source code is conceptually transformed into the basic character set + a bunch of <code>\uXXXX</code>. I say conceptually because what the implementations actually do is usually simpler, e.g. because they can deal with Unicode directly. The important part is that what the Standard call an extended character (i.e. not from the basic source set) should be indistinguishable in use from its equivalent <code>\uXXXX</code> form. Note that C++03 is available on e.g. EBCDIC platforms, so your reasoning in terms of ASCII is flawed from the get go. Finally, the process I described happens to (non raw) string literals too. That means your code is equivalent as if you'd have written: <pre class="prettyprint"><code>string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!"; string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!"; string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!"; </code></pre>

How does file encoding affect C++11 string literals?

Tags:

c++

c++11

encoding

string-literals

You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with u8/u/U respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

I understand you can still escape single unicode characters with \uNNNN, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character.

What I understand from various sources is that u should become equivalent to L on current Windows implementations and U on e.g. Linux implementations. So with that in mind, I'm also wondering what the required behavior is for the old string literal modifiers...

For the code-sample monkeys:

string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";

In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.

253

asked Jul 22 '11 18:07

rubenvb

2 Answers

In GCC, use -finput-charset=charset:

Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's "iconv" library routine.

Also check out the options -fexec-charset and -fwide-exec-charset.

Finally, about string literals:

char     a[] = "Hello";
wchar_t  b[] = L"Hello";
char16_t c[] = u"Hello";
char32_t d[] = U"Hello";

The size modifier of the string literal (L, u, U) merely determines the type of the literal.

answered Oct 06 '22 00:10

Kerrek SB

How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals. I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

From n3290, 2.2 Phases of translation [lex.phases]

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. [Here's a bit about trigraphs.] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

There are a lot of Standard terms being used to describe how an implementation deals with encodings. Here's my attempt at as somewhat simpler, step-by-step description of what happens:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...]

The issue of file encodings is handwaved; the Standard only cares about the basic source character set and leaves room for the implementation to get there.

Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.

The basic source set is a simple list of allowed characters. It is not ASCII (see further). Anything not in this list is 'transformed' (conceptually at least) to a \uXXXX form.

So no matter what kind of literal or file encoding is used, the source code is conceptually transformed into the basic character set + a bunch of \uXXXX. I say conceptually because what the implementations actually do is usually simpler, e.g. because they can deal with Unicode directly. The important part is that what the Standard call an extended character (i.e. not from the basic source set) should be indistinguishable in use from its equivalent \uXXXX form. Note that C++03 is available on e.g. EBCDIC platforms, so your reasoning in terms of ASCII is flawed from the get go.

Finally, the process I described happens to (non raw) string literals too. That means your code is equivalent as if you'd have written:

string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";

answered Oct 06 '22 00:10

Luc Danton

Related questions
                            
                                Value initialization: MSVC vs clang
                            
                                Very strange overload failure
                            
                                Is interpreting a pointer to first member as the class itself well defined?
                            
                                Why do strings in a std::vector<std::string> end up with the same data address?
                            
                                Ambiguous Overload Templates
                            
                                How can std::vector be converted to std::span?
                            
                                Where can I look at the C++ standard [duplicate]
                            
                                How to pass an array size as a template with template type?
                            
                                How to best write out a std::vector < std::string > container to a HDF5 dataset?
                            
                                Cross-platform crash handler
                            
                                How do I iterate over faces in CGAL
                            
                                Calculating large factorials in C++
                            
                                Declaration of template class member specialization
                            
                                enable_if and conversion operator?
                            
                                Where to delete an object created by factory?
                            
                                Width as a variable when using fscanf [duplicate]
                            
                                C++: nested class of a template class
                            
                                Help me understand this usage of boost::bind
                            
                                cmake glob include while preserving directory structure
                            
                                operator<<(ostream&, X) for class X nested in a class template

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With