Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types?

There are different encodings of the same Unicode (standardized) table. For example for UTF-8 encoding A corresponds to 0x0041 but for UTF-16 encoding the same A is represented as 0xfeff0041.

From this brilliant article I have learned that when I program by C++ for Windows platform and I deal with Unicode that I should know that it is represented in 2 bytes. But it does not say anything about the encoding. (Even it says that x86 CPUs are little-endian so I know how those two bytes are stored in memory.) But I should also know the encoding of the Unicode so that I have a complete information about how the symbols are stored in memory. Is there any fixed Unicode encoding for C++/Windows programmers?

like image 754
Narek Avatar asked Nov 21 '12 18:11

Narek


1 Answers

The values stored in memory for Windows are UTF-16 little-endian, always. But that's not what you're talking about - you're looking at file contents. Windows itself does not specify the encoding of files, it leaves that to individual applications.

The 0xfe 0xff you see at the start of the file is a Byte Order Mark or BOM. It not only indicates that the file is most probably Unicode, but it tells you which variant of Unicode encoding.

0xfe 0xff      UTF-16 big-endian
0xff 0xfe      UTF-16 little-endian
0xef 0xbb 0xbf UTF-8

A file that doesn't have a BOM should be assumed to be 8-bit characters unless you know how it was written. That still doesn't tell you if it's UTF-8 or some other Windows character encoding, you'll just have to guess.

You may use Notepad as an example of how this is done. If the file has a BOM then Notepad will read it and process the contents appropriately. Otherwise you must specify the coding yourself with the "Encoding" dropdown list.

Edit: the reason Windows documentation isn't more specific about the encoding is that Windows was a very early adopter of Unicode, and at the time there was only one encoding of 16 bits per code point. When 65536 code points were determined to be inadequate, surrogate pairs were invented as a way to extend the range and UTF-16 was born. Microsoft was already using Unicode to refer to their encoding and never changed.

like image 161
Mark Ransom Avatar answered Oct 24 '22 05:10

Mark Ransom