Is the <code>wchar_t</code> type required for unicode support? If not then what's the point of this multibyte type? Why would you use wchar_t when you could accomplish the same thing with <code>char</code>?

<h3>No.</h3> Technically, no. Unicode is a standard that defines code points and it does not require a particular encoding. So, you could use unicode with the UTF-8 encoding and then everything would fit in a one or a short sequence of <code>char</code> objects and it would even still be null-terminated. The problem with UTF-8 and UTF-16 is that <code>s[i]</code> is not necessarily a character any more, it might be just a piece of one, whereas with sufficiently wide characters you can preserve the abstraction that <code>s[i]</code> is a single character, tho it does not make strings fixed-length under various transformations. 32-bit integers are at least wide enough to solve the code point problem but they still don't handle corner cases, e.g., upcasing something can change the number of characters. So it turns out that the <code>x[i]</code> problem is not completely solved even by char32_t, and those other encodings make poor file formats. Your implied point, then, is quite valid: <code>wchar_t</code> is a failure, partly because Windows made it only 16 bits, and partly because it didn't solve every problem and was horribly incompatible with the byte stream abstraction.

Is wchar_t needed for unicode support?

2 Answers

No.

Technically, no. Unicode is a standard that defines code points and it does not require a particular encoding.

So, you could use unicode with the UTF-8 encoding and then everything would fit in a one or a short sequence of char objects and it would even still be null-terminated.

The problem with UTF-8 and UTF-16 is that s[i] is not necessarily a character any more, it might be just a piece of one, whereas with sufficiently wide characters you can preserve the abstraction that s[i] is a single character, tho it does not make strings fixed-length under various transformations.

32-bit integers are at least wide enough to solve the code point problem but they still don't handle corner cases, e.g., upcasing something can change the number of characters.

So it turns out that the x[i] problem is not completely solved even by char32_t, and those other encodings make poor file formats.

Your implied point, then, is quite valid: wchar_t is a failure, partly because Windows made it only 16 bits, and partly because it didn't solve every problem and was horribly incompatible with the byte stream abstraction.

140

answered Sep 29 '22 07:09

DigitalRoss

As has already been noted, wchar_t is absolutely not necessary for unicode support. Not only that, it is also utterly useless for that purpose, since the standard provides no fixed-size guarantee for wchar_t (in other words, you don't know ahead of time what sizeof( wchar_t ) will be on a particular system), whereas sizeof( char ) will always be 1.

In a UTF-8 encoding, any actual UNICODE character is mapped to a sequence of one or more (up to four, I believe) octets. In a UTF-16 encoding, any actual UNICODE character is mapped to a sequence of one or more (up to two, I believe) 16-bit words. In a UTF-32 encoding, any actual UNICODE character is mapped to exactly one 32-bit-word.

As you can see, wchar_t could be of some use for implementing UTF-16 support IF the standard was nice enough to guarantee that wchar_t is always 16 bits wide. Unfortunately it does not, so you'd have to revert to a fixed-width integer type from <cstdint> (such as std::uint16_t) anyway.

<slightly OffTopic Microsoft-specific rant>

What's more infuriating is the additional confusion caused by Microsoft's Visual Studio UNICODE and MBCS (multi-byte character set) build configurations. Both of these are

A) confusing and B) an outright lie

because neither does a "UNICODE" configuration in Visual Studio do anything to buy the programmer actual Unicode support, nor does the difference implied by these 2 build configurations make any sense. To explain, Microsoft recommends using TCHAR instead of using char or wchar_t directly. In an MBCS configuration, TCHAR expands to char, meaning you could potentially use this to implement UTF-8 support. In a UNICODE configuration, it expands to wchar_t, which in Visual Studio happens to be 16 bits wide and could potentially be used to implement UTF-16 support (which, as far as I'm aware, is the native encoding used by Windows). However, both of these encodings are multi-byte character sets, since both UTF-8 and UTF-16 allow for the possibility that a particular Unicode character may be encoded as more than a one char/wchar_t respectively, so the term multi-byte character set (as opposed to single-byte character set?) makes little sense.

To add insult to injury, merely using the Unicode configuration does not actually give you one iota of Unicode support. To actually get that, you have to use an actual Unicode library like ICU ( http://site.icu-project.org/ ). In short, the wchar_t type and Microsoft's MBCS and UNICODE configurations add nothing of any use and cause unnecessary confusion, and the world would be a significantly better place if none of them had ever been invented.

</slightly OffTopic Microsoft-specific rant>

answered Sep 29 '22 06:09

antred

Related questions
                            
                                Sort a vector in which the n first elements have been already sorted?
                            
                                How can I generate UUID in c++, without using boost library?
                            
                                Why is explicit allowed for default constructors and constructors with 2 or more (non-default) parameters?
                            
                                Is it possible to use signal inside a C++ class?
                            
                                Static constant versus constant in a function that is called repeatedly
                            
                                C++/Win32: How to wait for a pending delete to complete
                            
                                Math to convert seconds since 1970 into date and vice versa
                            
                                How to initialize an array of struct in C++?
                            
                                How to get CMake to recognize pthread on Ubuntu?
                            
                                How to use dylib in Mac OS X (C++)
                            
                                Issue with C++ constructor
                            
                                Linker error LNK1104 with 'libboost_filesystem-vc100-mt-s-1_49.lib'
                            
                                What does "Unexpected precompiled header error" mean?
                            
                                invalid use of template name without an argument list
                            
                                App does not run with VS 2008 SP1 DLLs, previous version works with RTM versions
                            
                                making a constant array in c++
                            
                                What is a toolchain and a cross compiler? [closed]
                            
                                activate RTTI in c++
                            
                                The best zip library with public domain license [closed]
                            
                                stdexcept vs exception Headers in c++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is wchar_t needed for unicode support?

Tags:

c++

c

unicode

zer0stimulus

People also ask

2 Answers

No.

DigitalRoss

antred

Recent Activity

Donate For Us