Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why isn't UTF-8 allowed as the "ANSI" code page?

The Windows _setmbcp function allows any valid code page...

(except UTF-7 and UTF-8, which are not supported)

OK, not supporting UTF-7 makes sense: Characters have non-unique representations and that introduces complexity and security risks.

But why not UTF-8?

As I understand it, the "ANSI" versions of the Windows API functions convert their arguments to UTF-16, call the equivalent "W" function, and convert any strings in the output to "ANSI". This is what I've been doing manually. So why can't Windows do it for me?

like image 417
dan04 Avatar asked Jun 08 '10 06:06

dan04


People also ask

Is UTF-8 the same as ANSI?

ANSI and UTF-8 are both encoding formats. ANSI is the common one byte format used to encode Latin alphabet; whereas, UTF-8 is a Unicode format of variable length (from 1 to 4 bytes) which can encode all possible characters.

Is UTF-8 A superset of ANSI?

ANSI and UTF-8 are two character encoding schemes that are widely used at one point in time or another. The main difference between them is use as UTF-8 has all but replaced ANSI as the encoding scheme of choice. UTF-8 was developed to create a more or less equivalent to ANSI but without the many disadvantages it had.

Is UTF-8 a code page?

UTF-8 is the universal code page for internationalization and is able to encode the entire Unicode character set. It is used pervasively on the web, and is the default for *nix-based platforms.

Does ANSI support Unicode?

Even though the code points of UTF-8 and ANSI are pretty much identical, older operating systems like Windows 95 cannot work with it. Therefore, programs that use Unicode would not be able to run properly on these operating systems.


1 Answers

The "ANSI" codepage is basically legacy: Windows 9X era. All modern software should be Unicode (that is, UTF-16) based anyway.

Basically, when the Ansi code page stuff was originally designed, UTF-8 wasn't even invented and so support for multi-byte encodings was rather haphazard (i.e. most Ansi code pages are single byte, with the exception of some East Asian code pages which are one-or-two byte). Adding support for "proper" multi-byte encodings was probably deemed not worth the effort when all new development should be done in UTF-16 anyway.

like image 107
Dean Harding Avatar answered Sep 18 '22 07:09

Dean Harding