Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why we convert from MultiByte to WideChar?

I am used to deal with ASCII strings but now with UNICODE I am too much confused about some terms:

What is a multi-byte character and what is a widechar What's the difference? Does multi-byte refers to a character that contains more than one byte in memory and widechar is just a data type to represent it?

  • Why do we convert from MultiByteToWideChar and WideCharToMultiByte?

If I declare something like this:

wchar_t* wcMsg = L"مرحبا";
MessageBoxW(0, wcMsg, 0, 0);

It prints the message correctly if I defined UNICODE But why I didn't convert here from WideCharToMultiByte??

  • What is the difference between character set in my project: _MBCS and UNICODE?

  • One last thing MSDN confuses me with "Windows APIs" are UTF-16.

Can anyone explain with some examples. A good clarification is really appreciated.

like image 850
WonFeiHong Avatar asked Nov 11 '17 12:11

WonFeiHong


1 Answers

An ASCII string has a char width of one byte (usually 8 bits, rarely 7, 9 or other bit widths). This is a legacy of the time, when memory size was very small and expensive, and also processors could often handle only one byte per instruction.

As it is easily imaginable, one byte is by far not enough to store all glyphs available on the world. Chinese alone has 87.000 glyphs. A char can usually only handle 256 glyphs (in an 8 bit byte). ASCII defines only 96 glyphs (plus the lower 32 chars, which are defined as non-printable control chars), which makes it a 7-bit charset. This is enough for English upper and lower chars, numbers and some interpunctuation and other glyphs. The highest bit in the common 8-bit byte is not used by ASCII.

To handle more glyphs than one byte can hold, one approach is to store the fundamental glyphs in one byte, other common glyphs in two bytes, and rarely used glyphs in 3 or even more bytes. This approach is called a Multi byte char set or Variable-width encoding. A very common example is UTF 8, which uses from one up to 4 byte for one character. It stores the ASCII charset in one byte (thus it's also backward compatible to ASCII). The highest bit is defined as a switch: if it's set, other bytes will follow. The same applies to the following bytes, so that a "chain" of up to 4 bytes is formed. The pro's of a variable-width charset are:

  • Backward compatibility with 7 bit ASCII charset
  • Memory friendly - uses as less memory as possible

The downside is:

  • More difficult and processor-expensive to handle. You can not simply iterate a string and assume that each myString[n] delivers one glyph; instead, you must evaluate each byte, if more bytes are following.

Another approach is to store each character in a fixed-length word made out of n bytes, which is wide enough to hold all possible glyphs. This is called a fixed width charset; all chars have the same width. A well known example is UTF32. It is 32 bit wide and can store all possible characters in one word. The pro's and con's of a fixed width charset are obviously the opposite of a variable-width charset: Memory-heavy but easier to iterate.

But Microsoft chose their native charset even before UTF32 was available: They use UTF16 as the char set of Windows, which uses a word length of at least 2 bytes (16 bit). This is large enough to store a lot more glyphs than in a single byte charset, but not all of them. Considering this, Microsofts differentiation between "Multi byte" and "Unicode" is a bit misleading today, because their unicode implementation is also a multi byte charset - just one with a bigger minimum size for one glyph. Some say that's a good compromise, some say it's the worst of both worlds - anyway, that's the way it is. And at that time (Windows NT) it was the only available Unicode charset, and from this perspective, their distinction between multi char and Unicode was correct at that time (see Raymond Chen's comment)

Of course, if you want to transfer a string in one encoding (let's say UTF8) into another one (let's say UTF16), you have to convert them. Thats what MultiByteToWideChar does for you, and WideCharToMultiByte vice versa. And there are some other conversion functions and libs as well.

This conversion costs pretty much time, and so the conclusion is: If you make heavy use of strings and system calls, for the sake of performance you should use the native charset of your operating system, which would be UTF16 in your case.

So for your string handling you should choose wchar_t, which in case of Windows means UTF16. Unfortunately, the width of wchar_t may vary from compiler to compiler; under Unix it is usually UTF32, under Windows it's UTF16.

_MBCS is an automatic preprocessor define which tells you that you have defined your character set as multi-byte, UNICODE tells you that you have set it to UTF16.

You can write

wchar_t* wcMsg = L"مرحبا";
MessageBoxW(0, wcMsg, 0, 0);

even in a program, which hasn't the UNICODE define set. The L" prefix defines, that your string is a UNICODE (wide char) string, and you can call system functions with it.

Unfortunately you can not write

char* msg = u8"مرحبا";
MessageBoxA(0, msg, 0, 0);

The char set support has been improved in C++11, and so you can also define a string as UTF8 by the prefix u8. But the windows functions with the "A" postfix doesn't understand UTF8, at least until Windows 10 Build 17035 (see tambre's comment) (see also https://stackoverflow.com/a/504789/2328447) This also suggests to use UTF16 aka UNICODE under Windows/Visual Studio.

Setting your project to "Use Multi-Byte Character Set" or "Use Unicode Character Set" also changes a lot of other character dependent defines: The most common ones are the macros TCHAR, _T() and all string dependent Windows functions without postfix, e.g. MessageBox() (without the W or A postfix) If you set your project to "Use Multi-Byte Character Set", TCHAR will expand to char, _T() will expand to nothing, and the Windows functions will get the A postfix attached. If you set your project to "Use Unicode Character Set", TCHAR will expand to wchar_t, _T() will expand to the L prefix, and the Windows functions will get the W postfix attached.

This means, that writing

TCHAR* msg = _T("Hello");
MessageBox(0, msg, 0, 0);

will compile both with multi byte charset or unicode set. You can find some comprehensive guides about these topics at MSDN.

Unfortunately

TCHAR* msg = _T("مرحبا");
MessageBox(0, msg, 0, 0);

still won't work when "Use Multi-Byte Character Set" is selected - the Windows functions still don't support UTF8, and you will even get some compiler warnings, because you have defined unicode characters, which are contained in a string not marked as Unicode (_T() does not expand to u8)

like image 64
user2328447 Avatar answered Oct 15 '22 05:10

user2328447