Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does file encoding affect C++11 string literals?

You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with u8/u/U respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

I understand you can still escape single unicode characters with \uNNNN, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character.

What I understand from various sources is that u should become equivalent to L on current Windows implementations and U on e.g. Linux implementations. So with that in mind, I'm also wondering what the required behavior is for the old string literal modifiers...

For the code-sample monkeys:

string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";

In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.

like image 253
rubenvb Avatar asked Jul 22 '11 18:07

rubenvb


People also ask

How is a string literal stored in the memory C++?

The characters of a literal string are stored in order at contiguous memory locations. An escape sequence (such as \\ or \") within a string literal counts as a single character. A null character (represented by the \0 escape sequence) is automatically appended to, and marks the end of, each string literal.

How do string literals work in C?

A "string literal" is a sequence of characters from the source character set enclosed in double quotation marks (" "). String literals are used to represent a sequence of characters which, taken together, form a null-terminated string. You must always prefix wide-string literals with the letter L.

What encoding does std::string use?

std::string per se uses no encoding -- it will return the bytes you put in it.

How does compiler process string literal?

The compiler scans the source code file, looks for, and stores all occurrences of string literals. It can use a mechanism such as a lookup table to do this. It then runs through the list and assigns the same address to all identical string literals.


2 Answers

In GCC, use -finput-charset=charset:

Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's "iconv" library routine.

Also check out the options -fexec-charset and -fwide-exec-charset.

Finally, about string literals:

char     a[] = "Hello";
wchar_t  b[] = L"Hello";
char16_t c[] = u"Hello";
char32_t d[] = U"Hello";

The size modifier of the string literal (L, u, U) merely determines the type of the literal.

like image 71
Kerrek SB Avatar answered Oct 06 '22 00:10

Kerrek SB


How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals. I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

From n3290, 2.2 Phases of translation [lex.phases]

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. [Here's a bit about trigraphs.] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

There are a lot of Standard terms being used to describe how an implementation deals with encodings. Here's my attempt at as somewhat simpler, step-by-step description of what happens:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...]

The issue of file encodings is handwaved; the Standard only cares about the basic source character set and leaves room for the implementation to get there.

Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.

The basic source set is a simple list of allowed characters. It is not ASCII (see further). Anything not in this list is 'transformed' (conceptually at least) to a \uXXXX form.

So no matter what kind of literal or file encoding is used, the source code is conceptually transformed into the basic character set + a bunch of \uXXXX. I say conceptually because what the implementations actually do is usually simpler, e.g. because they can deal with Unicode directly. The important part is that what the Standard call an extended character (i.e. not from the basic source set) should be indistinguishable in use from its equivalent \uXXXX form. Note that C++03 is available on e.g. EBCDIC platforms, so your reasoning in terms of ASCII is flawed from the get go.

Finally, the process I described happens to (non raw) string literals too. That means your code is equivalent as if you'd have written:

string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
like image 31
Luc Danton Avatar answered Oct 06 '22 00:10

Luc Danton