Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are `char16_t` and `char32_t` misnomers?

NB: I'm sure someone will call this subjective, but I reckon it's fairly tangible.

C++11 gives us new basic_string types std::u16string and std::u32string, type aliases for std::basic_string<char16_t> and std::basic_string<char32_t>, respectively.

The use of the substrings "u16" and "u32" to me in this context rather implies "UTF-16" and "UTF-32", which would be silly since C++ of course has no concept of text encodings.

The names in fact reflect the character types char16_t and char32_t, but these seem misnamed. They are unsigned, due to the unsignedness of their underlying types:

[C++11: 3.9.1/5]: [..] Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively [..]

But then it seems to me that these names violate the convention that such unsigned types have names beginning 'u', and that the use of numbers like 16 unqualified by terms like least indicate fixed-width types.

My question, then, is this: am I imagining things, or are these names fundamentally flawed?

like image 958
Lightness Races in Orbit Avatar asked Oct 08 '12 20:10

Lightness Races in Orbit


2 Answers

The naming convention to which you refer (uint32_t, int_fast32_t, etc.) is actually only used for typedefs, and not for primitive types. The primitive integer types are {signed, unsigned} {char, short, int, long, long long}, {as opposed to float or decimal types} ...

However, in addition to those integer types, there are four distinct, unique, fundamental types, char, wchar_t, char16_t and char32_t, which are the types of the respective literals '', L'', u'' and U'' and are used for alpha-numeric type data, and similarly for arrays of those. Those types are of course also integer types, and thus they will have the same layout at some of the arithmetic integer types, but the language makes a very clear distinction between the former, arithmetic types (which you would use for computations) and the latter "character" types which form the basic unit of some type of I/O data.

(I've previously rambled about those new types here and here.)

So, I think that char16_t and char32_t are actually very aptly named to reflect the fact that they belong to the "char" family of integer types.

like image 116
Kerrek SB Avatar answered Oct 04 '22 01:10

Kerrek SB


are these names fundamentally flawed?

(I think most of this question has been answered in the comments, but to make an answer) No, not at all. char16_t and char32_t were created for a specific purpose. To have data type support for all Unicode encoding formats (UTF-8 is covered by char) while keeping them as generic as possible to not limit them to only Unicode. Whether they are unsigned or have a fixed-width is not directly related to what they are: character data types. Types which hold and represent characters. Signedness is a property of data types that represent numbers not characters. The types are meant to store characters, either a 16 bit or 32 bit based character data, nothing more or less.

like image 35
Jesse Good Avatar answered Oct 04 '22 01:10

Jesse Good