Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly can wchar_t represent?

Tags:

According to cppreference.com's doc on wchar_t:

wchar_t - type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.

The Standard says in [basic.fundamental]/5:

Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.

So, if I want to deal with unicode characters, should I use wchar_t?

Equivalently, how do I know if a specific unicode character is "supported" by wchar_t?

like image 822
YSC Avatar asked May 18 '18 14:05

YSC


People also ask

What is the meaning of wchar_t?

The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems.

Why is wchar_t used?

wchar_t is used when you need to store a character over ASCII 255 , because these characters have a greater size than our character type 'char'. Hence, requiring more memory. It generally has a size greater than 8-bit character. The windows operating system uses it substantially.

Is wchar_t a UTF 16?

And wchar_t is utf-16 on Windows. So on Windows the conversion function can just do a memcpy :-) On everything else, the conversion is algorithmic, and pretty simple.


2 Answers

So, if I want to deal with unicode characters, should I use wchar_t?

First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char to represent Unicode characters just as wchar_t can - you only have to remember that up to 4 chars together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).

Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).

So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.

Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.

In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).

If you would like to do text-manipulation/interpretation on a code-point basis, char certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char.

The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.

As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t and char32_t data types, and u8/u/U literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.

like image 75
Jodocus Avatar answered Sep 20 '22 06:09

Jodocus


wchar_t is used in Windows which uses UTF16-LE format. wchar_t requires wide char functions. For example wcslen(const wchar_t*) instead of strlen(const char*) and std::wstring instead of std::string

Unix based machines (Linux, Mac, etc.) use UTF8. This uses char for storage, and the same C and C++ functions for ASCII, such as strlen(const char*) and std::string (see comments below about std::find_first_of)

wchar_t is 2 bytes (UTF16) in Windows. But in other machines it is 4 bytes (UTF32). This makes things more confusing.

For UTF32, you can use std::u32string which is the same on different systems.


You might consider converting UTF8 to UTF32, because that way each character is always 4 bytes, and you might think string operations will be easier. But that's rarely necessary.

UTF8 is designed so that ASCII characters between 0 and 128 are not used to represent other Unicode code points. That includes escape sequence '\', printf format specifiers, and common parsing characters like ,

Consider the following UTF8 string. Lets say you want to find the comma

std::string str = u8"汉,🙂"; //3 code points represented by 8 bytes 

The ASCII value for comma is 44, and str is guaranteed to contain only one byte whose value is 44. To find the comma, you can simply use any standard function in C or C++ to look for ','

To find , you can search for the string u8"汉" since this code point cannot be represented as a single character.

Some C and C++ functions don't work smoothly with UTF8. These include

strtok strspn std::find_first_of 

The argument for above functions is a set of characters, not an actual string.

So str.find_first_of(u8"汉") does not work. Because u8"汉" is 3 bytes, and find_first_of will look for any of those bytes. There is a chance that one of those bytes are used to represent a different code point.

On the other hand, str.find_first_of(u8",;abcd") is safe, because all the characters in the search argument are ASCII (str itself can contain any Unicode character)

In rare cases UTF32 might be required (although I can't imagine where!) You can use std::codecvt to convert UTF8 to UTF32 to run the following operations:

std::u32string u32 = U"012汉"; //4 code points, represented by 4 elements cout << u32.find_first_of(U"汉") << endl; //outputs 3 cout << u32.find_first_of(U'汉') << endl; //outputs 3 

Side note:

You should use "Unicode everywhere", not "UTF8 everywhere".

In Linux, Mac, etc. use UTF8 for Unicode.

In Windows, use UTF16 for Unicode. Windows programmers use UTF16, they don't make pointless conversions back and forth to UTF8. But there are legitimate cases for using UTF8 in Windows.

Windows programmer tend to use UTF8 for saving files, web pages, etc. So that's less worry for non-Windows programmers in terms of compatibility.

The language itself doesn't care which Unicode format you want to use, but in terms of practicality use a format that matches the system you are working on.

like image 29
Barmak Shemirani Avatar answered Sep 20 '22 06:09

Barmak Shemirani