Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get a code point literal in utf8

Tags:

c++

utf-8

c++17

I just recently realized, the u8 character prefix for C++17 is not meant for all utf8 code points, just for the ASCII part.

From cppreference

UTF-8 character literal, e.g. u8'a'. Such literal has type char and the value equal to ISO 10646 code point value of c-char, provided that the code point value is representable with a single UTF-8 code unit. If c-char is not in Basic Latin or C0 Controls Unicode block, the program is ill-formed.

auto hello = u8'嗨';     // ill-formed
auto world = u8"世";     // not a character
auto what = 0xE7958C;    // almost human-readable
auto wrong = u8"錯"[0];  // not even correct

How do I get a code point literal in utf8 succinctly?

EDIT: For the people wondering how a utf8 code point may be stored, a way I find reasonable is like the way Golang does it. The basic idea is to store a single code point in a 32-bit type when only a single code point is required.

EDIT2: From the arguments put out by the helpful comments, there is no reason to have encoded utf8 stay in a 32-bit type all along. Either have it decoded, which would be utf32 and have the prefix U, or have it encoded in a string, with the prefix u8.

like image 574
Passer By Avatar asked Jul 31 '17 15:07

Passer By


2 Answers

If you want a codepoint, then you should use char32_t and U for the prefix:

auto hello = U'嗨';

UTF-8 stores codepoints as a sequence of 8-bit code units. A char in C++ is a code unit, and therefore it cannot store an entire Unicode codepoint. The u8 prefix on character literals doesn't compile if you provide a codepoint that requires multiple code units to store, since a character literal only yields a single char.

If you want a single Unicode codepoint, encoded in UTF8, then what you want is a string literal, not a character literal:

auto hello = u8"嗨";

a way I find reasonable is like the way Golang does it.

Well, you're not using Go, are you?

In C++, if you ask for a character literal, then you mean a single object of that size's type. A u8 literal will always be a char. Its type will not vary based on what is in the literal. You asked for a character literal, you get a character literal.

From the website you linked to, it is clear that Go doesn't actually have the concept of a UTF-8 character literal at all. It simply has character literals, all of which are 32-bit values. In effect, all character literals in Go behave like U''.

like image 171
Nicol Bolas Avatar answered Sep 18 '22 23:09

Nicol Bolas


In C++, a character literal is exactly one character object. character object in C++ terminology corresponds to code unit in Unicode. Some code points of UTF-8 require more than one code unit. Therefore not all UTF-8 code points can be representable by a single character object. The code points that are representable, are the Basic Latin and C0 Control blocks.

To represent any UTF-8 code point, you need an array of code units i.e. a string. There is an analogous prefix for string literals: u8"☺".

like image 29
eerorika Avatar answered Sep 21 '22 23:09

eerorika