Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Size of wchar_t* for surrogate pair (Unicode character out of BMP) on Windows

I have encountered an interesting issue on Windows 8. I tested I can represent Unicode characters which are out of the BMP with wchar_t* strings. The following test code produced unexpected results for me:

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess).
int i3 = sizeof(s2); // i3 == 4, why?

The U+2008A is the Han character, which is out of the Binary Multilingual Pane, so it should be represented by a surrogate pair in UTF-16. Which means - if I understand it correctly - that it should be represented by two wchar_t characters. So I expected sizeof(s2) to be 6 (4 for the two wchar_t-s of the surrogate pair and 2 for the terminating \0).

So why is sizeof(s2) == 4? I tested that the s2 string has been constructed correctly, because I've rendered it with DirectWrite, and the Han character was displayed correctly.

UPDATE: As Naveen pointed out, I tried to determine the size of the arrays incorrectly. The following code produces correct result:

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

std::wstring str1 (s1);
std::wstring str2 (s2);

int i2 = str1.size(); // i2 == 1.
int i3 = str2.size(); // i3 == 2, because two wchar_t characters needed for the surrogate pair.
like image 217
Mark Vincze Avatar asked Jul 16 '12 12:07

Mark Vincze


People also ask

How many bytes is wchar_t?

Note that on AIX a wchar_t is 2 bytes.

What is surrogate pair in Unicode?

These characters have some special values; they are made up of two Unicode characters in two specific ranges such that the first Unicode character is in one range (for example 0xD800-0xD8FF) and the second Unicode character is in the second range (for example 0xDC00-0xDCFF). This is called a surrogate pair.

What encoding is wchar_t?

The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems.

What is the difference between wchar_t and char?

char is used for so called ANSI family of functions (typically function name ends with A ), or more commonly known as using ASCII character set. wchar_t is used for new so called Unicode (or Wide) family of functions (typically function name ends with W ), which use UTF-16 character set.


2 Answers

sizeof(s2) returns the number of bytes required to store the pointer s2 or any other pointer, which is 4 bytes on your system. It has nothing to do with the character(s) stored in pointed to by s2.

like image 74
Naveen Avatar answered Sep 20 '22 23:09

Naveen


sizeof(wchar_t*) is the same as sizeof(void*), in other words the size of a pointer itself. That will always 4 on a 32-bit system, and 8 on a 64-bit system. You need to use wcslen() or lstrlenW() instead of sizeof():

const wchar_t* s1 = L"a"; 
const wchar_t* s2 = L"\U0002008A"; // The "Han" character 

int i1 = sizeof(wchar_t); // i1 == 2
int i2 = wcslen(s1); // i2 == 1
int i3 = wcslen(s2); // i3 == 2
like image 20
Remy Lebeau Avatar answered Sep 16 '22 23:09

Remy Lebeau