I want to understand the difference between char
and wchar_t
? I understand that wchar_t
uses more bytes but can I get a clear cut example to differentiate when I would use char
vs wchar_t
The type unsigned char is often used to represent a byte, which isn't a built-in type in C++. The wchar_t type is an implementation-defined wide character type.
Just like the type for character constants is char, the type for wide character is wchar_t. This data type occupies 2 or 4 bytes depending on the compiler being used. Mostly the wchar_t datatype is used when international languages like Japanese are used.
wchar_t is used when you need to store characters with codes greater than 255 (it has a greater value than char can store).
And wchar_t is utf-16 on Windows. So on Windows the conversion function can just do a memcpy :-) On everything else, the conversion is algorithmic, and pretty simple.
The wchar_t, a.k.a. wide characters, provides more room for encodings. Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.
Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required. The wchar_t, a.k.a. wide characters, provides more room for encodings. Use char data type when the range of encodings is 256 or less, such as ASCII.
Under /J, they're treated as type unsigned char and get promoted to int without sign extension. The type unsigned char is often used to represent a byte, which isn't a built-in type in C++. The wchar_t type is an implementation-defined wide character type.
The char type was the original character type in C and C++. The char type can be used to store characters from the ASCII character set or any of the ISO-8859 character sets, and individual bytes of multi-byte characters such as Shift-JIS or the UTF-8 encoding of the Unicode character set. In the Microsoft compiler, char is an 8-bit type.
Short anwser:
You should never use wchar_t
in modern C++, except when interacting with OS-specific APIs (basically use wchar_t
only to call Windows API functions).
Long answer:
Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of std::exception::what
).
In a C++ program you have two locales:
std::setlocale
std::locale::global
Unfortunately, none of them defines behavior of standard functions that open files (like std::fopen
, std::fstream::open
etc). Behavior differs between OSes:
Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to main
functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default "C"
locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and std::string
assuming it is UTF-8 sequences and everything "just works".
Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like std::fopen(const char *)
and std::fstream::open(const char *)
to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like _wfopen
, std::fstream::open(const wchar_t *)
on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/
With C++17 you can use std::filesystem::path
to store file path in a portable way, but it is still broken on Windows:
std::filesystem::path::path(const char *)
uses user-specific locale on MSVC and there is no way to make it use UTF-8. Function std::filesystem::u8string
should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead.std::error_category::message(int)
for both error categories returns error description using user-specific encoding.So what we have on Windows is:
main(int, char**)
are broken and should never be used.std::filesystem::path
is partially broken and should never be used directly.std::generic_category
and std::system_category
are broken and should never be used.If you need long term solution for a non-trivial project, I would recommend:
std::generic_category
and std::system_category
so that they would always return UTF-8 encoded strings.std::filesystem::path
so that new class would always use UTF-8 when converting path to string and string to path.std::filesystem
so that they would use your path wrapper and your error categories.Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode).
You can check this link for further explanation: http://utf8everywhere.org/
Fundamentally, use wchar_t
when the encoding has more symbols than a char
can contain.
Background
The char
type has enough capacity to hold any character (encoding) in the ASCII character set.
The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char
type does not guarantee a range greater than 256. Thus a new data type is required.
The wchar_t
, a.k.a. wide characters, provides more room for encodings.
Summary
Use char
data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t
when you need the capacity for more than 256.
Prefer Unicode to handle large character sets (such as emojis).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With