Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1
This is my expectation:
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);
This is the result on g++ 4.8.1
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() == 3);
In my world, regardless of the encoding of the source file the resulting utf8 string should be longer than 3.
Or, have I totally misunderstood the semantics of u8, and the use-case it targets? Please enlighten me.
Update
If I explicitly tell the compiler what encoding the source file is, as many suggested, I got the expected behavior for u8 literals. But, regular literals also gets encoded to utf8
That is:
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);
assert( utf8 == "åäö");
I'm even more confused now than before...
For standard C++ the u8 prefix does produces a char based literal.
A character literal is composed of a constant character. It's represented by the character surrounded by single quotation marks. There are five kinds of character literals: Ordinary character literals of type char , for example 'a'
In string and character sequences, when you want the backslash to represent itself (rather than the beginning of an escape sequence), you must use a \\ backslash escape sequence.
A string literal is a sequence of zero or more characters enclosed within single quotation marks. The following are examples of string literals: 'Hello, world!' 'He said, "Take it or leave it."'
The u8
prefix really just means "when compiling this code, generate a UTF-8 string from this literal". It says nothing about how the literal in the source file should be interpreted by the compiler.
So you have several factors at play:
u8
prefix.Most likely, #2 is where this goes wrong. If the compiler interprets the source file as ISO-8859, then it will read the three characters, convert them to UTF-8, and write those, giving you a 6-byte (I think each of those chars encodes to 2 byte in UTF-8) string as a result.
However, if it assumes the source file to be UTF-8, then it won't need to do a conversion at all: it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8), and since you asked for the output string to be UTF-8 as well, it just outputs those same 3 bytes.
You can tell GCC which source encoding to assume with -finput-charset
, or you can encode the source as UTF-8, or you can use the \uXXXX
escape sequences in the string literal ( \u00E5
instead of å
, for example)
To clarify a bit, when you specify a string literal with the u8
prefix in your source code, then you are telling the compiler that "regardless of which encoding you used when reading the source text, please convert it to UTF-8 when writing it out to the object file". You are saying nothing about how the source text should be interpreted. That is up to the compiler to decide (perhaps based on which flags you passed to it, perhaps based on the process' environment, or perhaps just using a hardcoded default)
If the string in your source text contains the bytes 0xc5, 0xe4, 0xf6, and you tell it that "the source text is encoded as ISO-8859", then the compiler will recognize that "the string consists of the characters "åäö". It will see the u8
prefix, and convert these characters to UTF-8, writing the byte sequence 0xc3, 0xa5, 0xc3, 0xa4, 0xc3, 0xb6 to the object file. In this case, you end up with a valid UTF-8 encoded text string containing the UTF-8 representation of the characters "åäö".
However, if the string in your source text contains the same byte, and you make the compiler believe that the source text is encoded as UTF-8, then there are two things the compiler may do (depending on implementation:
u8
prefix here means that I am supposed to write this string as UTF-8. Cool, no need to do a conversion then. I'll just write these 3 bytes and I'm done". This is what GCC does.Both are valid. The C++ language doesn't state that the compiler is required to check the validity of the string literals you pass to it.
But in both cases, note that the u8
prefix has nothing to do with your problem. That just tells the compiler to convert from "whatever encoding the string had when you read it, to UTF-8". But even before this conversion, the string was already garbled, because the bytes corresponded to ISO-8859 character data, but the compiler believed them to be UTF-8 (because you didn't tell it otherwise).
The problem you are seeing is simply that the compiler didn't know which encoding to use when reading the string literal from your source file.
The other thing you are noticing is that a "traditional" string literal, with no prefix, is going to be encoded with whatever encoding the compiler likes. The u8
prefix (and the corresponding UTF-16 and UTF-32 prefixes) were entroduced precisely to allow you to specify which encoding you wanted the compiler to write the output in. The plain prefix-less literals do not specify an encoding at all, leaving it up to the compiler to decide on one.
In order to illustrate this discussion, here are some examples. Let's consider the code:
int main() {
std::cout << "åäö\n";
}
1) Compiling this with g++ -std=c++11 encoding.cpp
will produce an executable that yields:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
In other words, two bytes per "grapheme cluster" (according to unicode jargon, i.e. in this case, per character), plus the final newline (0a). This is because my file is encoded in utf-8, the input-charset is assumed to be utf-8 by cpp, and the exec-charset is utf-8 by default in gcc (see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html). Good.
2) Now if I convert my file to iso-8859-1 and compile again using the same command, I get:
% ./a.out | od -txC
0000000 e5 e4 f6 0a
i.e. the three characters are now encoded using iso-8859-1. I am not sure about the magic going on here, as this time it seems that cpp correctly guessed that the file was iso-8859-1 (without any hint), converted it to utf-8 internally (according to the link above) but the compiler still stored the iso-8859-1 string in the binary. This we can check by looking at the .rodata section of the binary:
% objdump -s -j .rodata a.out
a.out: file format elf64-x86-64
Contents of section .rodata:
400870 01000200 00e5e4f6 0a00 ..........
(Note the "e5e4f6" sequence of bytes).
This makes perfect sense as a programmer who uses latin-1 literals does not expect them to come out as utf-8 strings in his program's output.
3) Now if I keep the same iso-8859-1-encoded file, but compile with g++ -std=c++11 -finput-charset=iso-8859-1 encoding.cpp
, then I get a binary that ouptuts utf-8 data:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
I find this weird: the source encoding has not changed, I explicitly tell gcc it is latin-1, and I get utf-8 as a result! Note that this can be overriden if I explicitly request the exec-charset with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp
:
% ./a.out | od -txC
0000000 e5 e4 f6 0a
It is not clear to me how these two options interact...
4) Now let's add the "u8" prefix into the mix:
int main() {
std::cout << u8"åäö\n";
}
If the file is utf-8-encoded, unsurprisingly compiling with defaults char-sets (g++ -std=c++11 encoding.cpp
), the output is utf-8 as well. If I request the compiler to use iso-8859-1 internally instead (g++ -std=c++11 -fexec-charset=iso-8859-1 encoding.cpp
), the output is still utf-8:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
So it looks like the prefix "u8" prevented the compiler to convert the literal to the execution character set. Even better, if I convert the same source file to iso-8859-1, and compile with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp
, then I still get utf-8 output:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
So it seems that"u8" actually acts as an "operator" that tells the compiler "convert this literal to utf-8".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With