C++ utf-8 literals in GCC and MSVC

Question

Here i have some simple code:

#include <iostream>
#include <cstdint>

    int main()
    {
         const unsigned char utf8_string[] = u8"\xA0";
         std::cout << std::hex << "Size: " << sizeof(utf8_string) << std::endl;
          for (int i=0; i < sizeof(utf8_string); i++) {
            std::cout << std::hex << (uint16_t)utf8_string[i] << std::endl;
          }
    }

I see different behavior here with MSVC and GCC. MSVC sees "\xA0" as not encoded unicode sequence, and encodes it to utf-8. So in MSVC the output is:

C2A0

Which is correctly encoded in utf8 unicode symbol U+00A0.

But in case of GCC nonthing happens. It treats string as simple bytes. There's no change even if i remove u8 before string literal.

Both compilers encode to utf8 with output C2A0 if the string is set to: u8"\u00A0";

Why do compilers behave differently and which actually does it right?

Software used for test:

GCC 8.3.0

MSVC 19.00.23506

C++ 11

Etienne Laurin · Accepted Answer

They're both wrong.

As far as I can tell, the C++17 standard says here that:

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

Although there are other hints, this seems to be the strongest indication that escape sequences are not multi-byte and that MSVC's behaviour is wrong.

There are tickets for this which are currently marked as Under Investigation:

https://developercommunity.visualstudio.com/content/problem/225847/hex-escape-codes-in-a-utf8-literal-are-treated-in.html
https://developercommunity.visualstudio.com/content/problem/260684/escape-sequences-in-unicode-string-literals-are-ov.html

However it also says here about UTF-8 literals that:

If the value is not representable with a single UTF-8 code unit, the program is ill-formed.

Since 0xA0 is not a valid UTF-8 character, the program should not compile.

Note that:

UTF-8 literals starting with u8 are defined as being narrow.
\xA0 is an escape sequence
\u00A0 is considered a universal character name and not an escape sequence

user17732522 · Answer

This is CWG issue 1656.

It has been resolved in the current standard draft through P2029R4 so that the numeric escape sequences are to be considered by their value as a single code unit, not as a unicode code point which is then encoded to UTF-8. This is even if it results in an invalid UTF-8 sequence.

Therefore GCC's behavior is/will be correct.

Mark Ransom · Answer

I can't tell you which way is true to the standard.

The way MSVC does it is at least logically consistent and easily explainable. The three escape sequences \x, \u, and \U behave identically except for the number of hex digits they pull from the input: 2, 4, or 8. Each defines a Unicode codepoint that must then be encoded to UTF-8. To embed a byte without encoding leads to the possibility of creating an invalid UTF-8 sequence.

C++ utf-8 literals in GCC and MSVC

Tags:

c++

gcc

unicode

utf-8

visual-c++

toozyfuzzy

3 Answers

Etienne Laurin

user17732522

Mark Ransom

Recent Activity

Donate For Us

C++ utf-8 literals in GCC and MSVC

Tags:

c++

gcc

unicode

utf-8

visual-c++

toozyfuzzy

3 Answers

Etienne Laurin

user17732522

Mark Ransom

Related questions

Recent Activity

Donate For Us