char vs wchar_t vs char16_t vs char32_t (c++11)

Tags:

From what I understand, a char is safe to house ASCII characters whereas char16_t and char32_t are safe to house characters from unicode, one for the 16-bit variety and another for the 32-bit variety (Should I have said "a" instead of "the"?). But I'm then left wondering what the purpose behind the wchar_t is. Should I ever use that type in new code, or is it simply there to support old code? What was the purpose of wchar_t in old code if, from what I understand, its size had no guarantee to be bigger than a char? Clarification would be nice!

999

asked Sep 28 '13 15:09

user904963

2 Answers

char is for 8-bit code units, char16_t is for 16-bit code units, and char32_t is for 32-bit code units. Any of these can be used for 'Unicode'; UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units, and UTF-32 uses 32-bit code units.

The guarantee made for wchar_t was that any character supported in a locale could be converted from char to wchar_t, and whatever representation was used for char, be it multiple bytes, shift codes, what have you, the wchar_t would be a single, distinct value. The purpose of this was that then you could manipulate wchar_t strings just like the simple algorithms used with ASCII.

For example, converting ascii to upper case goes like:

auto loc = std::locale("");  char s[] = "hello"; for (char &c : s) {   c = toupper(c, loc); }

But this won't handle converting all characters in UTF-8 to uppercase, or all characters in some other encoding like Shift-JIS. People wanted to be able to internationalize this code like so:

auto loc = std::locale("");  wchar_t s[] = L"hello"; for (wchar_t &c : s) {   c = toupper(c, loc); }

So every wchar_t is a 'character' and if it has an uppercase version then it can be directly converted. Unfortunately this doesn't really work all the time; For example there exist oddities in some languages such as the German letter ß where the uppercase version is actually the two characters SS instead of a single character.

So internationalized text handling is intrinsically harder than ASCII and cannot really be simplified in the way the designers of wchar_t intended. As such wchar_t and wide characters in general provide little value.

The only reason to use them is that they've been baked into some APIs and platforms. However, I prefer to stick to UTF-8 in my own code even when developing on such platforms, and to just convert at the API boundaries to whatever encoding is required.

135

answered Nov 09 '22 06:11

bames53

The type wchar_t was put into the standard when Unicode promised to create a 16 bit representation. Most vendors choose to make wchar_t 32 bits but one large vendor has chosen to to make it 16 bits. Since Unicode uses more than 16 bits (e.g., 20 bits) it was felt that we should have better character types.

The intent for char16_t is to represent UTF16 and char32_t is meant to directly represent Unicode characters. However, on systems using wchar_t as part of their fundamental interface, you'll be stuck with wchar_t. If you are unconstrained I would personally use char to represent Unicode using UTF8. The problem with char16_t and char32_t is that they are not fully supported, not even in the standard C++ library: for example, there are no streams supporting these types directly and it more work than just instantiating the stream for these types.

answered Nov 09 '22 07:11

Dietmar Kühl

Related questions
                            
                                At what point is it worth using a database?
                            
                                Why does stringstream >> change value of target on failure?
                            
                                How to compile Clang on Windows
                            
                                Fit rectangle around points
                            
                                Are there types bigger than long long int in C++?
                            
                                C++: Redirecting STDOUT
                            
                                Is it wise to ignore gcc/clang's "-Wmissing-braces" warning?
                            
                                Why is C++11 constexpr so restrictive?
                            
                                Does std::atomic<std::string> work appropriately?
                            
                                What's the difference between cstdlib and stdlib.h?
                            
                                What does the 'void()' in 'auto f(params) -> decltype(..., void())' do?
                            
                                Creating/writing into a new file in Qt
                            
                                Using declared variable in a range-based for-loop
                            
                                Namespaces and operator resolution
                            
                                c++ issue with function overloading in an inherited class
                            
                                How to read a CMake Variable in C++ source code
                            
                                Is make_shared really more efficient than new?
                            
                                Why does Java read a big file faster than C++?
                            
                                dereferencing a pointer when passing by reference
                            
                                What is a jump table?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With