Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

char vs wchar_t vs char16_t vs char32_t (c++11)

Tags:

c++

c++11

From what I understand, a char is safe to house ASCII characters whereas char16_t and char32_t are safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_t is. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_t in old code if, from what I understand, its size had no guarantee to be bigger than a char? Clarification would be nice!

like image 999
user904963 Avatar asked Sep 28 '13 15:09

user904963


People also ask

What is the difference between wchar_t and char?

The type unsigned char is often used to represent a byte, which isn't a built-in type in C++. The wchar_t type is an implementation-defined wide character type.

What is char16_t?

char16_t is an unsigned integer type used for 16-bit wide characters and is the same type as uint_least16_t. uint_least16_t is the smallest unsigned integer type with width of at least 16 bits.

What is wchar_t data type?

Just like the type for character constants is char, the type for wide character is wchar_t. This data type occupies 2 or 4 bytes depending on the compiler being used. Mostly the wchar_t datatype is used when international languages like Japanese are used.

What is the size of wchar_t?

the default size of wchar_t is 4 bytes.


2 Answers

char is for 8-bit code units, char16_t is for 16-bit code units, and char32_t is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.


The guarantee made for wchar_t was that any character supported in a locale could be converted from char to wchar_t, and whatever representation was used for char, be it multiple bytes, shift codes, what have you, the wchar_t would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t strings just like the simple algorithms used with ASCII.

For example, converting ascii to upper case goes like:

auto loc = std::locale("");  char s[] = "hello"; for (char &c : s) {   c = toupper(c, loc); } 

But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:

auto loc = std::locale("");  wchar_t s[] = L"hello"; for (wchar_t &c : s) {   c = toupper(c, loc); } 

So every wchar_t is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ß where the uppercase version is actually the two characters SS instead of a single character.

So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t intended. As such wchar_t and wide characters in general provide little value.

The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.

like image 135
bames53 Avatar answered Nov 09 '22 06:11

bames53


The type wchar_t was put into the standard when Unicode promised to create a 16 bit representation. Most vendors choose to make wchar_t 32 bits but one large vendor has chosen to to make it 16 bits. Since Unicode uses more than 16 bits (e.g., 20 bits) it was felt that we should have better character types.

The intent for char16_t is to represent UTF16 and char32_t is meant to directly represent Unicode characters. However, on systems using wchar_t as part of their fundamental interface, you'll be stuck with wchar_t. If you are unconstrained I would personally use char to represent Unicode using UTF8. The problem with char16_t and char32_t is that they are not fully supported, not even in the standard C++ library: for example, there are no streams supporting these types directly and it more work than just instantiating the stream for these types.

like image 34
Dietmar Kühl Avatar answered Nov 09 '22 07:11

Dietmar Kühl