I want to understand the difference between <code>char</code> and <code>wchar_t</code> ? I understand that <code>wchar_t</code> uses more bytes but can I get a clear cut example to differentiate when I would use <code>char</code> vs <code>wchar_t</code>

Short anwser: You should never use <code>wchar_t</code> in modern C++, except when interacting with OS-specific APIs (basically use <code>wchar_t</code> only to call Windows API functions). Long answer: Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of <code>std::exception::what</code>). In a C++ program you have two locales: <ul> <li>Standard C library locale set by <code>std::setlocale</code> </li> <li>Standard C++ library locale set by <code>std::locale::global</code> </li> </ul> Unfortunately, none of them defines behavior of standard functions that open files (like <code>std::fopen</code>, <code>std::fstream::open</code> etc). Behavior differs between OSes: <ul> <li>Linux is encoding agnostic, so those function simply pass char string to underlying system call</li> <li>On Windows char string is converted to wide string using user specific locale before system call is made</li> </ul> Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to <code>main</code> functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default <code>"C"</code> locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and <code>std::string</code> assuming it is UTF-8 sequences and everything "just works". Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like <code>std::fopen(const char *)</code> and <code>std::fstream::open(const char *)</code> to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like <code>_wfopen</code>, <code>std::fstream::open(const wchar_t *)</code> on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/ With C++17 you can use <code>std::filesystem::path</code> to store file path in a portable way, but it is still broken on Windows: <ul> <li>Implicit constructor <code>std::filesystem::path::path(const char *)</code> uses user-specific locale on MSVC and there is no way to make it use UTF-8. Function <code>std::filesystem::u8string</code> should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead.</li> <li> <code>std::error_category::message(int)</code> for both error categories returns error description using user-specific encoding.</li> </ul> So what we have on Windows is: <ul> <li>Standard library functions that open files are broken and should never be used.</li> <li>Arguments passed to <code>main(int, char**)</code> are broken and should never be used.</li> <li>WinAPI functions ending with *A and macros are broken and should never be used.</li> <li> <code>std::filesystem::path</code> is partially broken and should never be used directly.</li> <li>Error categories returned by <code>std::generic_category</code> and <code>std::system_category</code> are broken and should never be used.</li> </ul> If you need long term solution for a non-trivial project, I would recommend: <ul> <li>Using Boost.Nowide or implementing similar functionality directly - this fixes broken standard library.</li> <li>Re-implementing standard error categories returned by <code>std::generic_category</code> and <code>std::system_category</code> so that they would always return UTF-8 encoded strings.</li> <li>Wrapping <code>std::filesystem::path</code> so that new class would always use UTF-8 when converting path to string and string to path.</li> <li>Wrapping all required functions from <code>std::filesystem</code> so that they would use your path wrapper and your error categories.</li> </ul> Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode). You can check this link for further explanation: http://utf8everywhere.org/

Fundamentally, use <code>wchar_t</code> when the encoding has more symbols than a <code>char</code> can contain. Background The <code>char</code> type has enough capacity to hold any character (encoding) in the ASCII character set. The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A <code>char</code> type does not guarantee a range greater than 256. Thus a new data type is required. The <code>wchar_t</code>, a.k.a. wide characters, provides more room for encodings. Summary Use <code>char</code> data type when the range of encodings is 256 or less, such as ASCII. Use <code>wchar_t</code> when you need the capacity for more than 256. Prefer Unicode to handle large character sets (such as emojis).

char vs wchar_t when to use which data type

2 Answers

Short anwser:

You should never use wchar_t in modern C++, except when interacting with OS-specific APIs (basically use wchar_t only to call Windows API functions).

Long answer:

Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of std::exception::what).

In a C++ program you have two locales:

Standard C library locale set by std::setlocale
Standard C++ library locale set by std::locale::global

Unfortunately, none of them defines behavior of standard functions that open files (like std::fopen, std::fstream::open etc). Behavior differs between OSes:

Linux is encoding agnostic, so those function simply pass char string to underlying system call
On Windows char string is converted to wide string using user specific locale before system call is made

Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to main functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default "C" locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and std::string assuming it is UTF-8 sequences and everything "just works".

Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like std::fopen(const char *) and std::fstream::open(const char *) to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like _wfopen, std::fstream::open(const wchar_t *) on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/

With C++17 you can use std::filesystem::path to store file path in a portable way, but it is still broken on Windows:

Implicit constructor std::filesystem::path::path(const char *) uses user-specific locale on MSVC and there is no way to make it use UTF-8. Function std::filesystem::u8string should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead.
std::error_category::message(int) for both error categories returns error description using user-specific encoding.

So what we have on Windows is:

Standard library functions that open files are broken and should never be used.
Arguments passed to main(int, char**) are broken and should never be used.
WinAPI functions ending with *A and macros are broken and should never be used.
std::filesystem::path is partially broken and should never be used directly.
Error categories returned by std::generic_category and std::system_category are broken and should never be used.

If you need long term solution for a non-trivial project, I would recommend:

Using Boost.Nowide or implementing similar functionality directly - this fixes broken standard library.
Re-implementing standard error categories returned by std::generic_category and std::system_category so that they would always return UTF-8 encoded strings.
Wrapping std::filesystem::path so that new class would always use UTF-8 when converting path to string and string to path.
Wrapping all required functions from std::filesystem so that they would use your path wrapper and your error categories.

Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode).

You can check this link for further explanation: http://utf8everywhere.org/

answered Oct 16 '22 18:10

StaceyGirl

Fundamentally, use wchar_t when the encoding has more symbols than a char can contain.

Background
The char type has enough capacity to hold any character (encoding) in the ASCII character set.

The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required.

The wchar_t, a.k.a. wide characters, provides more room for encodings.

Summary
Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.

Prefer Unicode to handle large character sets (such as emojis).

answered Oct 16 '22 17:10

Thomas Matthews

Related questions
                            
                                Kotlin - How to make field read-only for external classes
                            
                                Very simple log4j2 properties configuration file using Console and Rolling File appender
                            
                                XML Parsing Error: no root element found Location in Console FF
                            
                                Enumerate Dictionary Iterating Key and Value [duplicate]
                            
                                Remote debugging Java 9 in a docker container from IntelliJ IDEA
                            
                                Typescript - set default value for class members
                            
                                Android set full screen from fragment
                            
                                django update_or_create gets "duplicate key value violates unique constraint "
                            
                                I am getting "bash: airflow: command not found"
                            
                                cURL send JSON as x-www-form-urlencoded [closed]
                            
                                How to store methods as function pointers in a map container?
                            
                                How to convert a pytorch tensor into a numpy array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

char vs wchar_t when to use which data type

Tags:

Ankit Goel

People also ask

2 Answers

StaceyGirl

Thomas Matthews

Recent Activity

Donate For Us