Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

char vs wchar_t when to use which data type

Tags:

I want to understand the difference between char and wchar_t ? I understand that wchar_t uses more bytes but can I get a clear cut example to differentiate when I would use char vs wchar_t

like image 651
Ankit Goel Avatar asked Aug 14 '17 15:08

Ankit Goel


People also ask

What is the difference between wchar_t and char?

The type unsigned char is often used to represent a byte, which isn't a built-in type in C++. The wchar_t type is an implementation-defined wide character type.

What is wchar_t data type?

Just like the type for character constants is char, the type for wide character is wchar_t. This data type occupies 2 or 4 bytes depending on the compiler being used. Mostly the wchar_t datatype is used when international languages like Japanese are used.

Why do we use wchar_t in C++?

wchar_t is used when you need to store characters with codes greater than 255 (it has a greater value than char can store).

Is wchar_t a UTF-16?

And wchar_t is utf-16 on Windows. So on Windows the conversion function can just do a memcpy :-) On everything else, the conversion is algorithmic, and pretty simple.

What is the difference between wchar_t and char data type?

The wchar_t, a.k.a. wide characters, provides more room for encodings. Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.

Why do we need wchar_t in C++?

Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required. The wchar_t, a.k.a. wide characters, provides more room for encodings. Use char data type when the range of encodings is 256 or less, such as ASCII.

What is the difference between wchar_t and int under J?

Under /J, they're treated as type unsigned char and get promoted to int without sign extension. The type unsigned char is often used to represent a byte, which isn't a built-in type in C++. The wchar_t type is an implementation-defined wide character type.

What is char type in C++?

The char type was the original character type in C and C++. The char type can be used to store characters from the ASCII character set or any of the ISO-8859 character sets, and individual bytes of multi-byte characters such as Shift-JIS or the UTF-8 encoding of the Unicode character set. In the Microsoft compiler, char is an 8-bit type.


2 Answers

Short anwser:

You should never use wchar_t in modern C++, except when interacting with OS-specific APIs (basically use wchar_t only to call Windows API functions).

Long answer:

Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of std::exception::what).

In a C++ program you have two locales:

  • Standard C library locale set by std::setlocale
  • Standard C++ library locale set by std::locale::global

Unfortunately, none of them defines behavior of standard functions that open files (like std::fopen, std::fstream::open etc). Behavior differs between OSes:

  • Linux is encoding agnostic, so those function simply pass char string to underlying system call
  • On Windows char string is converted to wide string using user specific locale before system call is made

Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to main functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default "C" locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and std::string assuming it is UTF-8 sequences and everything "just works".

Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like std::fopen(const char *) and std::fstream::open(const char *) to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like _wfopen, std::fstream::open(const wchar_t *) on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/

With C++17 you can use std::filesystem::path to store file path in a portable way, but it is still broken on Windows:

  • Implicit constructor std::filesystem::path::path(const char *) uses user-specific locale on MSVC and there is no way to make it use UTF-8. Function std::filesystem::u8string should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead.
  • std::error_category::message(int) for both error categories returns error description using user-specific encoding.

So what we have on Windows is:

  • Standard library functions that open files are broken and should never be used.
  • Arguments passed to main(int, char**) are broken and should never be used.
  • WinAPI functions ending with *A and macros are broken and should never be used.
  • std::filesystem::path is partially broken and should never be used directly.
  • Error categories returned by std::generic_category and std::system_category are broken and should never be used.

If you need long term solution for a non-trivial project, I would recommend:

  • Using Boost.Nowide or implementing similar functionality directly - this fixes broken standard library.
  • Re-implementing standard error categories returned by std::generic_category and std::system_category so that they would always return UTF-8 encoded strings.
  • Wrapping std::filesystem::path so that new class would always use UTF-8 when converting path to string and string to path.
  • Wrapping all required functions from std::filesystem so that they would use your path wrapper and your error categories.

Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode).

You can check this link for further explanation: http://utf8everywhere.org/

like image 58
StaceyGirl Avatar answered Oct 16 '22 18:10

StaceyGirl


Fundamentally, use wchar_t when the encoding has more symbols than a char can contain.

Background
The char type has enough capacity to hold any character (encoding) in the ASCII character set.

The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required.

The wchar_t, a.k.a. wide characters, provides more room for encodings.

Summary
Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.

Prefer Unicode to handle large character sets (such as emojis).

like image 29
Thomas Matthews Avatar answered Oct 16 '22 17:10

Thomas Matthews