According to cppreference.com's doc on wchar_t
:
wchar_t
- type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.
The Standard says in [basic.fundamental]/5
:
Type
wchar_t
is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Typewchar_t
shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Typeschar16_t
andchar32_t
denote distinct types with the same size, signedness, and alignment asuint_least16_t
anduint_least32_t
, respectively, in<cstdint>
, called the underlying types.
So, if I want to deal with unicode characters, should I use wchar_t
?
Equivalently, how do I know if a specific unicode character is "supported" by wchar_t
?
The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems.
wchar_t is used when you need to store a character over ASCII 255 , because these characters have a greater size than our character type 'char'. Hence, requiring more memory. It generally has a size greater than 8-bit character. The windows operating system uses it substantially.
And wchar_t is utf-16 on Windows. So on Windows the conversion function can just do a memcpy :-) On everything else, the conversion is algorithmic, and pretty simple.
So, if I want to deal with unicode characters, should I use
wchar_t
?
First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char
to represent Unicode characters just as wchar_t
can - you only have to remember that up to 4 char
s together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t
can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).
Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).
So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.
Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.
In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).
If you would like to do text-manipulation/interpretation on a code-point basis, char
certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char
.
The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.
As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t
and char32_t
data types, and u8
/u
/U
literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.
wchar_t
is used in Windows which uses UTF16-LE format. wchar_t
requires wide char functions. For example wcslen(const wchar_t*)
instead of strlen(const char*)
and std::wstring
instead of std::string
Unix based machines (Linux, Mac, etc.) use UTF8. This uses char
for storage, and the same C and C++ functions for ASCII, such as strlen(const char*)
and std::string
(see comments below about std::find_first_of
)
wchar_t
is 2 bytes (UTF16) in Windows. But in other machines it is 4 bytes (UTF32). This makes things more confusing.
For UTF32, you can use std::u32string
which is the same on different systems.
You might consider converting UTF8 to UTF32, because that way each character is always 4 bytes, and you might think string operations will be easier. But that's rarely necessary.
UTF8 is designed so that ASCII characters between 0 and 128 are not used to represent other Unicode code points. That includes escape sequence '\'
, printf
format specifiers, and common parsing characters like ,
Consider the following UTF8 string. Lets say you want to find the comma
std::string str = u8"汉,🙂"; //3 code points represented by 8 bytes
The ASCII value for comma is 44
, and str
is guaranteed to contain only one byte whose value is 44
. To find the comma, you can simply use any standard function in C or C++ to look for ','
To find 汉
, you can search for the string u8"汉"
since this code point cannot be represented as a single character.
Some C and C++ functions don't work smoothly with UTF8. These include
strtok strspn std::find_first_of
The argument for above functions is a set of characters, not an actual string.
So str.find_first_of(u8"汉")
does not work. Because u8"汉"
is 3 bytes, and find_first_of
will look for any of those bytes. There is a chance that one of those bytes are used to represent a different code point.
On the other hand, str.find_first_of(u8",;abcd")
is safe, because all the characters in the search argument are ASCII (str
itself can contain any Unicode character)
In rare cases UTF32 might be required (although I can't imagine where!) You can use std::codecvt
to convert UTF8 to UTF32 to run the following operations:
std::u32string u32 = U"012汉"; //4 code points, represented by 4 elements cout << u32.find_first_of(U"汉") << endl; //outputs 3 cout << u32.find_first_of(U'汉') << endl; //outputs 3
Side note:
You should use "Unicode everywhere", not "UTF8 everywhere".
In Linux, Mac, etc. use UTF8 for Unicode.
In Windows, use UTF16 for Unicode. Windows programmers use UTF16, they don't make pointless conversions back and forth to UTF8. But there are legitimate cases for using UTF8 in Windows.
Windows programmer tend to use UTF8 for saving files, web pages, etc. So that's less worry for non-Windows programmers in terms of compatibility.
The language itself doesn't care which Unicode format you want to use, but in terms of practicality use a format that matches the system you are working on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With