Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is Microsoft using as the data type for Unicode Strings?

Tags:

c++

unicode

atl

wtl

I am in the process of learning C++ and came across an article on the MSDN here:

http://msdn.microsoft.com/en-us/magazine/dd861344.aspx

In the first code example the one line of code which my question relates to is the following:

VERIFY(SetWindowText(L"Direct2D Sample"));

More specifically that L prefix. I had a little read up, and correct me if I am wrong :-), but this is to allow for unicode strings, i.e. to prep for a long character set. Now in during my read up on this I came across another article on Adavnced String Techniques in C here http://www.flipcode.com/archives/Advanced_String_Techniques_in_C-Part_I_Unicode.shtml

It says there are a few options including the inclusion of the header:

#define UNICODE 

OR

#define _UNICODE

in C , again point out if I am wrong, appreciate your feedback. Further it shows the datatype suitable for these unicode strings being:

wchar_t

It throws into the mix a macro and a kind of hybrid datatype, the macro being:

_TEXT(t)

which simply prefixes the string with the L and the hybrid data type as

TCHAR 

Which it points out will allow for unicode if the header is there and ASCII if not. Now my question is, or more of an asumption which I would like to confirm, would Microsoft use this TCHAR data type which is more flexible or is there any benefit to committing to using the wchar_t.

Also when I say does Microsoft use this, more specifically for exmaple in the ATL and WTL libraries, do anyone of yourselves have preference or have some advice regarding this?

Cheers,

Andrew

like image 683
REA_ANDREW Avatar asked Aug 27 '09 10:08

REA_ANDREW


People also ask

Does Microsoft use Unicode?

Microsoft was one of the first companies to implement Unicode in their products.

What is Unicode string type?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

Which data type is used for string?

The string data types are CHAR , VARCHAR , BINARY , VARBINARY , BLOB , TEXT , ENUM , and SET .

What character encoding does Windows use?

The character set most commonly used in computers today is Unicode, a global standard for character encoding. Internally, Windows applications use the UTF-16 implementation of Unicode. In UTF-16, most characters are identified by two-byte codes.


2 Answers

For all new software you should define UNICODE and use wchar_t directly. Using ANSI stirngs will come back to haunt you.

You should just use wchar_t and the wide versions of all the CRT functions (ex: wcscmp instead of strcmp). The TEXT macros and TCHAR etc just exist if your code needs to work in both ANSI and UNICODE environments which I feel code rarely needs to do.

When you create a new windows application using Visual Studio UNICODE is automatically defined and wchar_t will work like a built-in.

like image 93
obelix Avatar answered Sep 20 '22 18:09

obelix


Short answer: the hybrid infrastructure with the TCHAR type, the _TEXT() macro and the various _t* functions (_tcscpy comes to mind) are a throwback to the times when Microsoft had two platforms coexisting:

  1. The Windows NT line was based on the Unicode string representation
  2. The Windows 95/98/ME line was based on ANSI string representation.

String representation here means that all the Windows APIs that expected or returned string to your app used one or the other representation for these strings. COM added even more confusion as it was available on both platforms -- and expected Unicode strings on both!

In those old times it was encouraged that you write "portable" code: you were instructed to use the hybrid infrastructure for your strings so that you can compile for both models just by defining/undefining UNICODE and/or _UNICODE for your app.

As the Windows9x line is no more relevant (for the vast majority of the apps anyway) you can safely ignore the ANSI world and use the Unicode strings directly.

Beware though that Unicode has multiple representations today: as it is pointed out above the Unicode convention implied by wchar_t is the UCS-2 representation (all characters encoded in 16-bit words). There are other, widely used representations where this is not necessarily true.

like image 31
LaszloG Avatar answered Sep 24 '22 18:09

LaszloG