I just recently realized, the u8
character prefix for C++17 is not meant for all utf8 code points, just for the ASCII part.
From cppreference
UTF-8 character literal, e.g.
u8'a'
. Such literal has typechar
and the value equal to ISO 10646 code point value of c-char, provided that the code point value is representable with a single UTF-8 code unit. If c-char is not in Basic Latin or C0 Controls Unicode block, the program is ill-formed.
auto hello = u8'嗨'; // ill-formed
auto world = u8"世"; // not a character
auto what = 0xE7958C; // almost human-readable
auto wrong = u8"錯"[0]; // not even correct
How do I get a code point literal in utf8 succinctly?
EDIT: For the people wondering how a utf8 code point may be stored, a way I find reasonable is like the way Golang does it. The basic idea is to store a single code point in a 32-bit type when only a single code point is required.
EDIT2: From the arguments put out by the helpful comments, there is no reason to have encoded utf8 stay in a 32-bit type all along. Either have it decoded, which would be utf32 and have the prefix U
, or have it encoded in a string, with the prefix u8
.
If you want a codepoint, then you should use char32_t
and U
for the prefix:
auto hello = U'嗨';
UTF-8 stores codepoints as a sequence of 8-bit code units. A char
in C++ is a code unit, and therefore it cannot store an entire Unicode codepoint. The u8
prefix on character literals doesn't compile if you provide a codepoint that requires multiple code units to store, since a character literal only yields a single char
.
If you want a single Unicode codepoint, encoded in UTF8, then what you want is a string literal, not a character literal:
auto hello = u8"嗨";
a way I find reasonable is like the way Golang does it.
Well, you're not using Go, are you?
In C++, if you ask for a character literal, then you mean a single object of that size's type. A u8
literal will always be a char
. Its type will not vary based on what is in the literal. You asked for a character literal, you get a character literal.
From the website you linked to, it is clear that Go doesn't actually have the concept of a UTF-8 character literal at all. It simply has character literals, all of which are 32-bit values. In effect, all character literals in Go behave like U''
.
In C++, a character literal is exactly one character object. character object in C++ terminology corresponds to code unit in Unicode. Some code points of UTF-8 require more than one code unit. Therefore not all UTF-8 code points can be representable by a single character object. The code points that are representable, are the Basic Latin and C0 Control blocks.
To represent any UTF-8 code point, you need an array of code units i.e. a string. There is an analogous prefix for string literals: u8"☺"
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With