Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how does windows wchar_t handle unicode characters outside the basic multilingual plane?

I've looked at a number of other posts here and elsewhere (see below), but I still don't have a clear answer to this question: How does windows wchar_t handle unicode characters outside the basic multilingual plane?

That is:

  • many programmers seem to feel that UTF-16 is harmful because it is a variable-length code.
  • wchar_t is 16-bits wide on windows, but 32-bits wide on Unix/MacOS
  • The Windows APIs use wide-characters, not Unicode.

So what does Windows do when you want to code something like π ‚Š (U+2008A) Han Character on Windows?

like image 971
vy32 Avatar asked Oct 23 '11 23:10

vy32


2 Answers

The implementation of wchar_t under the Windows stdlib is UTF-16-oblivious: it knows only about 16-bit code units.

So you can put a UTF-16 surrogate sequence in a string, and you can choose to treat that as a single character using higher level processing. The string implementation won't do anything to help you, nor to hinder you; it will let you include any sequence of code units in your string, even ones that would be invalid when interpreted as UTF-16.

Many of the higher-level features of Windows do support characters made out of UTF-16 surrogates, which is why you can call a file 𐐀.txt and see it both render correctly and edit correctly (taking a single keypress, not two, to move past the character) in programs like Explorer that support complex text layout (typically using Windows's Uniscribe library).

But there are still places where you can see the UTF-16-obliviousness shining through, such as the fact you can create a file called 𐐀.txt in the same folder as 𐐨.txt, where case-insensitivity would otherwise disallow it, or the fact that you can create [U+DC01][U+D801].txt programmatically.

This is how pedants can have a nice long and basically meaningless argument about whether Windows β€œsupports” UTF-16 strings or only UCS-2.

like image 125
bobince Avatar answered Sep 28 '22 08:09

bobince


Windows used to use UCS-2 but adopted UTF-16 with Windows 2000. Windows wchar_t APIs now produce and consume UTF-16.

Not all third party programs handle this correctly and so may be buggy with data outside the BMP.

Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).

like image 37
bames53 Avatar answered Sep 28 '22 09:09

bames53