What exactly can wchar_t represent?

Tags:

According to cppreference.com's doc on wchar_t:

wchar_t - type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.

The Standard says in [basic.fundamental]/5:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <cstdint>, called the underlying types.

So, if I want to deal with unicode characters, should I use wchar_t?

Equivalently, how do I know if a specific unicode character is "supported" by wchar_t?

822

asked May 18 '18 14:05

YSC

2 Answers

So, if I want to deal with unicode characters, should I use wchar_t?

First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char to represent Unicode characters just as wchar_t can - you only have to remember that up to 4 chars together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).

Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).

So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.

Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.

In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).

If you would like to do text-manipulation/interpretation on a code-point basis, char certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char.

The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.

As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t and char32_t data types, and u8/u/U literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.

answered Sep 20 '22 06:09

Jodocus

wchar_t is used in Windows which uses UTF16-LE format. wchar_t requires wide char functions. For example wcslen(const wchar_t*) instead of strlen(const char*) and std::wstring instead of std::string

Unix based machines (Linux, Mac, etc.) use UTF8. This uses char for storage, and the same C and C++ functions for ASCII, such as strlen(const char*) and std::string (see comments below about std::find_first_of)

wchar_t is 2 bytes (UTF16) in Windows. But in other machines it is 4 bytes (UTF32). This makes things more confusing.

For UTF32, you can use std::u32string which is the same on different systems.

You might consider converting UTF8 to UTF32, because that way each character is always 4 bytes, and you might think string operations will be easier. But that's rarely necessary.

UTF8 is designed so that ASCII characters between 0 and 128 are not used to represent other Unicode code points. That includes escape sequence '\', printf format specifiers, and common parsing characters like ,

Consider the following UTF8 string. Lets say you want to find the comma

std::string str = u8"汉,🙂"; //3 code points represented by 8 bytes

The ASCII value for comma is 44, and str is guaranteed to contain only one byte whose value is 44. To find the comma, you can simply use any standard function in C or C++ to look for ','

To find 汉, you can search for the string u8"汉" since this code point cannot be represented as a single character.

Some C and C++ functions don't work smoothly with UTF8. These include

strtok strspn std::find_first_of

The argument for above functions is a set of characters, not an actual string.

So str.find_first_of(u8"汉") does not work. Because u8"汉" is 3 bytes, and find_first_of will look for any of those bytes. There is a chance that one of those bytes are used to represent a different code point.

On the other hand, str.find_first_of(u8",;abcd") is safe, because all the characters in the search argument are ASCII (str itself can contain any Unicode character)

In rare cases UTF32 might be required (although I can't imagine where!) You can use std::codecvt to convert UTF8 to UTF32 to run the following operations:

std::u32string u32 = U"012汉"; //4 code points, represented by 4 elements cout << u32.find_first_of(U"汉") << endl; //outputs 3 cout << u32.find_first_of(U'汉') << endl; //outputs 3

Side note:

You should use "Unicode everywhere", not "UTF8 everywhere".

In Linux, Mac, etc. use UTF8 for Unicode.

In Windows, use UTF16 for Unicode. Windows programmers use UTF16, they don't make pointless conversions back and forth to UTF8. But there are legitimate cases for using UTF8 in Windows.

Windows programmer tend to use UTF8 for saving files, web pages, etc. So that's less worry for non-Windows programmers in terms of compatibility.

The language itself doesn't care which Unicode format you want to use, but in terms of practicality use a format that matches the system you are working on.

answered Sep 20 '22 06:09

Barmak Shemirani

Related questions
                            
                                Youtube player support fragment no longer working on Android studio 3.2 (androidx)
                            
                                Angular 6 validate number input
                            
                                How to wrap or truncate long strings in a Material-UI ExpansionPanelSummary
                            
                                How to disable [abc] suggestion on vs-code?
                            
                                Java collections sort method for string is not working properly for case sensitive and special characters
                            
                                How can I get access_token from auth-module, Nuxt.js
                            
                                How to update Formik Field from external actions
                            
                                Pandas aggregate with dynamic column names
                            
                                Github actions - how to deploy to remote server using SSH
                            
                                Do modern GHC versions have any kind of proof erasure?
                            
                                How to type a computed property in the new composition API?
                            
                                Secret manager access denied despite correct roles for service account

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With