Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ utf-8 literals in GCC and MSVC

Here i have some simple code:

#include <iostream>
#include <cstdint>

    int main()
    {
         const unsigned char utf8_string[] = u8"\xA0";
         std::cout << std::hex << "Size: " << sizeof(utf8_string) << std::endl;
          for (int i=0; i < sizeof(utf8_string); i++) {
            std::cout << std::hex << (uint16_t)utf8_string[i] << std::endl;
          }
    }

I see different behavior here with MSVC and GCC. MSVC sees "\xA0" as not encoded unicode sequence, and encodes it to utf-8. So in MSVC the output is:

C2A0

Which is correctly encoded in utf8 unicode symbol U+00A0.

But in case of GCC nonthing happens. It treats string as simple bytes. There's no change even if i remove u8 before string literal.

Both compilers encode to utf8 with output C2A0 if the string is set to: u8"\u00A0";

Why do compilers behave differently and which actually does it right?

Software used for test:

GCC 8.3.0

MSVC 19.00.23506

C++ 11

like image 949
toozyfuzzy Avatar asked Apr 29 '20 15:04

toozyfuzzy


3 Answers

They're both wrong.

As far as I can tell, the C++17 standard says here that:

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

Although there are other hints, this seems to be the strongest indication that escape sequences are not multi-byte and that MSVC's behaviour is wrong.

There are tickets for this which are currently marked as Under Investigation:

  • https://developercommunity.visualstudio.com/content/problem/225847/hex-escape-codes-in-a-utf8-literal-are-treated-in.html
  • https://developercommunity.visualstudio.com/content/problem/260684/escape-sequences-in-unicode-string-literals-are-ov.html

However it also says here about UTF-8 literals that:

If the value is not representable with a single UTF-8 code unit, the program is ill-formed.

Since 0xA0 is not a valid UTF-8 character, the program should not compile.

Note that:

  • UTF-8 literals starting with u8 are defined as being narrow.
  • \xA0 is an escape sequence
  • \u00A0 is considered a universal character name and not an escape sequence
like image 89
Etienne Laurin Avatar answered Sep 21 '22 22:09

Etienne Laurin


This is CWG issue 1656.

It has been resolved in the current standard draft through P2029R4 so that the numeric escape sequences are to be considered by their value as a single code unit, not as a unicode code point which is then encoded to UTF-8. This is even if it results in an invalid UTF-8 sequence.

Therefore GCC's behavior is/will be correct.

like image 32
user17732522 Avatar answered Sep 22 '22 22:09

user17732522


I can't tell you which way is true to the standard.

The way MSVC does it is at least logically consistent and easily explainable. The three escape sequences \x, \u, and \U behave identically except for the number of hex digits they pull from the input: 2, 4, or 8. Each defines a Unicode codepoint that must then be encoded to UTF-8. To embed a byte without encoding leads to the possibility of creating an invalid UTF-8 sequence.

like image 33
Mark Ransom Avatar answered Sep 20 '22 22:09

Mark Ransom