Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is C++20 'char8_t' the same as our old 'char'?

Tags:

c++

c++20

c++14

In the CPP reference documentation,

I noticed for char

The character types are large enough to represent any UTF-8 eight-bit code unit (since C++14)

and for char8_t

type for UTF-8 character representation, required to be large enough to represent any UTF-8 code unit (8 bits)

Does that mean both are the same type? Or does char8_t have some other feature?

like image 655
Pavan Chandaka Avatar asked Aug 07 '19 21:08

Pavan Chandaka


1 Answers

Disclaimer: I'm the author of the char8_t P0482 and P1423 proposals.

In C++20, char8_t is a distinct type from all other types. In the related proposal for C, N2653, char8_t is a typedef of unsigned char similar to the existing typedefs for char16_t and char32_t.

In C++20, char8_t has an underlying representation that matches unsigned char. It therefore has the same size (at least 8-bit, but may be larger), alignment, and integer conversion rank as unsigned char, but has different aliasing rules.

In particular, char8_t was not added to the list of types at [basic.lval]p11. [basic.life]p6.4, [basic.types]p2, or [basic.types]p4. This means that, unlike unsigned char, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other types. A consequence of this is that objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:

reinterpret_cast<const char   *>(u8"text"); // Ok. reinterpret_cast<const char8_t*>("text");   // Undefined behavior. 

The motivation for a distinct type with these properties is:

  1. To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.

  2. To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).

  3. To ensure an unsigned type for UTF-8 data (whether char is signed or unsigned is implementation defined).

  4. To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.

like image 100
Tom Honermann Avatar answered Sep 17 '22 18:09

Tom Honermann