Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are u8-literals supposed to work?

Tags:

c++

c++11

Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1

This is my expectation:

const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);

This is the result on g++ 4.8.1

const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() == 3);
  • The source file is ISO-8859(-1)
  • We use these compiler directives: -m64 -std=c++11 -pthread -O3 -fpic

In my world, regardless of the encoding of the source file the resulting utf8 string should be longer than 3.

Or, have I totally misunderstood the semantics of u8, and the use-case it targets? Please enlighten me.

Update

If I explicitly tell the compiler what encoding the source file is, as many suggested, I got the expected behavior for u8 literals. But, regular literals also gets encoded to utf8

That is:

const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);
assert( utf8 == "åäö");
  • compiler directive: g++ -m64 -std=c++11 -pthread -O3 -finput-charset=ISO8859-1
  • Tried a few other charset defined from iconv, ex: ISO_8859-1 and so on...

I'm even more confused now than before...

like image 532
Fredrik Avatar asked May 05 '14 12:05

Fredrik


People also ask

What is u8 in C++?

For standard C++ the u8 prefix does produces a char based literal.

What are character literals in C++?

A character literal is composed of a constant character. It's represented by the character surrounded by single quotation marks. There are five kinds of character literals: Ordinary character literals of type char , for example 'a'

How do you add an escape character to a string in C++?

In string and character sequences, when you want the backslash to represent itself (rather than the beginning of an escape sequence), you must use a \\ backslash escape sequence.

Which of the following is a string literal?

A string literal is a sequence of zero or more characters enclosed within single quotation marks. The following are examples of string literals: 'Hello, world!' 'He said, "Take it or leave it."'


2 Answers

The u8 prefix really just means "when compiling this code, generate a UTF-8 string from this literal". It says nothing about how the literal in the source file should be interpreted by the compiler.

So you have several factors at play:

  1. which encoding is the source file written in (In your case, apparently ISO-8859). According to this encoding, the string literal is "åäö" (3 bytes, containing the values 0xc5, 0xe4, 0xf6)
  2. which encoding does the compiler assume when reading the source file? (I suspect that GCC defaults to UTF-8, but I could be wrong.
  3. the encoding that the compiler uses for the generated string in the object file. You specify this to be UTF-8 via the u8 prefix.

Most likely, #2 is where this goes wrong. If the compiler interprets the source file as ISO-8859, then it will read the three characters, convert them to UTF-8, and write those, giving you a 6-byte (I think each of those chars encodes to 2 byte in UTF-8) string as a result.

However, if it assumes the source file to be UTF-8, then it won't need to do a conversion at all: it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8), and since you asked for the output string to be UTF-8 as well, it just outputs those same 3 bytes.

You can tell GCC which source encoding to assume with -finput-charset, or you can encode the source as UTF-8, or you can use the \uXXXX escape sequences in the string literal ( \u00E5 instead of å, for example)

Edit:

To clarify a bit, when you specify a string literal with the u8 prefix in your source code, then you are telling the compiler that "regardless of which encoding you used when reading the source text, please convert it to UTF-8 when writing it out to the object file". You are saying nothing about how the source text should be interpreted. That is up to the compiler to decide (perhaps based on which flags you passed to it, perhaps based on the process' environment, or perhaps just using a hardcoded default)

If the string in your source text contains the bytes 0xc5, 0xe4, 0xf6, and you tell it that "the source text is encoded as ISO-8859", then the compiler will recognize that "the string consists of the characters "åäö". It will see the u8 prefix, and convert these characters to UTF-8, writing the byte sequence 0xc3, 0xa5, 0xc3, 0xa4, 0xc3, 0xb6 to the object file. In this case, you end up with a valid UTF-8 encoded text string containing the UTF-8 representation of the characters "åäö".

However, if the string in your source text contains the same byte, and you make the compiler believe that the source text is encoded as UTF-8, then there are two things the compiler may do (depending on implementation:

  • it might try to parse the bytes as UTF-8, in which case it will recognize that "this is not a valid UTF-8 sequence", and issue an error. This is what Clang does.
  • alternatively, it might say "ok, I have 3 bytes here, I am told to assume that they form a valid UTF-8 string. I'll hold on to them and see what happens". Then, when it is supposed to write the string to the object file, it goes "ok, I have these 3 bytes from before, which are marked as being UTF-8. The u8 prefix here means that I am supposed to write this string as UTF-8. Cool, no need to do a conversion then. I'll just write these 3 bytes and I'm done". This is what GCC does.

Both are valid. The C++ language doesn't state that the compiler is required to check the validity of the string literals you pass to it.

But in both cases, note that the u8 prefix has nothing to do with your problem. That just tells the compiler to convert from "whatever encoding the string had when you read it, to UTF-8". But even before this conversion, the string was already garbled, because the bytes corresponded to ISO-8859 character data, but the compiler believed them to be UTF-8 (because you didn't tell it otherwise).

The problem you are seeing is simply that the compiler didn't know which encoding to use when reading the string literal from your source file.

The other thing you are noticing is that a "traditional" string literal, with no prefix, is going to be encoded with whatever encoding the compiler likes. The u8 prefix (and the corresponding UTF-16 and UTF-32 prefixes) were entroduced precisely to allow you to specify which encoding you wanted the compiler to write the output in. The plain prefix-less literals do not specify an encoding at all, leaving it up to the compiler to decide on one.

like image 179
jalf Avatar answered Oct 13 '22 19:10

jalf


In order to illustrate this discussion, here are some examples. Let's consider the code:

int main() {
  std::cout << "åäö\n";
}

1) Compiling this with g++ -std=c++11 encoding.cpp will produce an executable that yields:

% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a

In other words, two bytes per "grapheme cluster" (according to unicode jargon, i.e. in this case, per character), plus the final newline (0a). This is because my file is encoded in utf-8, the input-charset is assumed to be utf-8 by cpp, and the exec-charset is utf-8 by default in gcc (see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html). Good.

2) Now if I convert my file to iso-8859-1 and compile again using the same command, I get:

% ./a.out | od -txC
0000000 e5 e4 f6 0a

i.e. the three characters are now encoded using iso-8859-1. I am not sure about the magic going on here, as this time it seems that cpp correctly guessed that the file was iso-8859-1 (without any hint), converted it to utf-8 internally (according to the link above) but the compiler still stored the iso-8859-1 string in the binary. This we can check by looking at the .rodata section of the binary:

% objdump -s -j .rodata a.out

a.out:     file format elf64-x86-64

Contents of section .rodata:
400870 01000200 00e5e4f6 0a00               ..........

(Note the "e5e4f6" sequence of bytes).
This makes perfect sense as a programmer who uses latin-1 literals does not expect them to come out as utf-8 strings in his program's output.

3) Now if I keep the same iso-8859-1-encoded file, but compile with g++ -std=c++11 -finput-charset=iso-8859-1 encoding.cpp, then I get a binary that ouptuts utf-8 data:

% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a

I find this weird: the source encoding has not changed, I explicitly tell gcc it is latin-1, and I get utf-8 as a result! Note that this can be overriden if I explicitly request the exec-charset with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp:

% ./a.out | od -txC
0000000 e5 e4 f6 0a

It is not clear to me how these two options interact...

4) Now let's add the "u8" prefix into the mix:

int main() {
  std::cout << u8"åäö\n";
}

If the file is utf-8-encoded, unsurprisingly compiling with defaults char-sets (g++ -std=c++11 encoding.cpp), the output is utf-8 as well. If I request the compiler to use iso-8859-1 internally instead (g++ -std=c++11 -fexec-charset=iso-8859-1 encoding.cpp), the output is still utf-8:

% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a

So it looks like the prefix "u8" prevented the compiler to convert the literal to the execution character set. Even better, if I convert the same source file to iso-8859-1, and compile with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp, then I still get utf-8 output:

% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a

So it seems that"u8" actually acts as an "operator" that tells the compiler "convert this literal to utf-8".

like image 27
v.p Avatar answered Oct 13 '22 20:10

v.p