Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why the creators of Windows and Linux systems chose different ways to support Unicode?

As far as I know Linux chose backward compatibility of UTF-8, whereas Windows added completely new API functions for UTF-16 (ending with "W"). Could these decisions be different? Which one proved better?

like image 458
Michal Czardybon Avatar asked May 28 '10 11:05

Michal Czardybon


1 Answers

UTF-16 is pretty much a loss, the worst of both worlds. It is neither compact (for the typical case of ASCII characters), nor does it map each code unit to a character. This hasn't really bitten anyone too badly yet since characters outside of the Basic Multilingual Plane are still rarely-used, but it sure is ugly.

POSIX (Linux et al) has some w APIs too, based on the wchar_t type. On platforms other than Windows this typically corresponds to UTF-32 rather than UTF-16. Which is nice for easy string manipulation, but is incredibly bloated.

But in-memory APIs aren't really that important. What causes much more difficulty is file storage and on-the-wire protocols, where data is exchanged between applications with different charset traditions.

Here, compactness beats ease-of-indexing; UTF-8 is clearly proven the best format for this by far, and Windows's poor support of UTF-8 causes real difficulties. Windows is the last modern operating system to still have locale-specific default encodings; everyone else has moved to UTF-8 by default.

Whilst I seriously hope Microsoft will reconsider this for future versions, as it causes huge and unnecessary problems even within the Windows-only world, it's understandable how it happened.

The thinking back in the old days when WinNT was being designed was that UCS-2 was it for Unicode. There wasn't going to be anything outside the 16-bit character range. Everyone would use UCS-2 in-memory and naturally it would be easiest to save this content directly from memory. This is why Windows called that format “Unicode”, and still to this day calls UTF-16LE just “Unicode” in UI like save boxes, despite it being totally misleading.

UTF-8 wasn't even standardised until Unicode 2.0 (along with the extended character range and the surrogates that made UTF-16 what it is today). By then Microsoft were on to WinNT4, at which point it was far too late to change strategy. In short, Microsoft were unlucky to be designing a new OS from scratch around the time when Unicode was in its infancy.

like image 97
bobince Avatar answered Nov 15 '22 07:11

bobince