I've been reading some articles about Unicode and realized I'm still left confused what to exactly do about it.
As a c++ programmer on Windows platform, the disciplines given to me were mostly same from any teacher: always use Unicode character set; templatize it or use TCHAR if possible; prefer wchar_t, std::wstring over char, std::string.
#include <tchar.h>
#include <string>
typedef std::basic_string<TCHAR> tstring;
// ...
static const char* const s_hello = "핼로"; // bad
static const wchar_t* const s_wchar_hello = L"핼로" // better
static LPCTSTR s_tchar_hello = TEXT("핼로") // even better
static const tstring s_tstring_hello( TEXT("핼로") ); // best
Somehow I messed up, and I lead myself to believe that If I say "something", that means it is ASCII formatted, and if I say L"something" it is Unicode. Then I read this:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in , called the underlying types.
So what? If my locale says start from codepage 949, the extend of wchar_t is from 949 + 2^(sizeof(wchar_t)*8)? And the way it speaks sounds like 'I don't care if your implementation of c++ use UTF encoding or what'.
At least, I could understand that everything depends on what locale the application is on. Thus I tested:
#define TEST_OSTREAM_PRINT(x) \
std::cout << "----" << std::endl; \
std::cout << "cout : " << x << std::endl; \
std::wcout << "wcout : " << L##x << std::endl;
int main()
{
std::ostream& os = std::cout;
std::cout << " * Info : " << std::endl
<< " sizeof(char) : " << sizeof(char) << std::endl
<< " sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl
<< " littel endian? : " << IsLittelEndian() << std::endl;
std::cout << " - LC_ALL: " << setlocale(LC_ALL, NULL) << std::endl;
std::cout << " - LC_CTYPE: " << setlocale(LC_CTYPE, NULL) << std::endl;
TEST_OSTREAM_PRINT("핼로");
TEST_OSTREAM_PRINT("おはよう。");
TEST_OSTREAM_PRINT("你好");
TEST_OSTREAM_PRINT("resume");
TEST_OSTREAM_PRINT("résumé");
return 0;
}
Then output was:
Info
sizeof(char) = 1
sizeof(wchar_t) = 2
LC_ALL = C
LC_CTYPE = C
----
cout : 핼로
wcout : ----
cout : おはよう。
wcout : ----
cout : ?好
wcout : ----
cout : resume
wcout : resume
----
cout : r?sum?
wcout : r?um
Another output with Korean locale:
Info
sizeof(char) = 1
sizeof(wchar_t) = 2
LC_ALL = Korean_Korea.949
LC_CTYPE = Korean_Korea.949
----
cout : 핼로
wcout : 핼로
----
cout : おはよう。
wcout : おはよう。
----
cout : ?好
wcout : ----
cout : resume
wcout : resume
----
cout : r?sum?
wcout : resume
Another output:
Info
sizeof(char) = 1
sizeof(wchar_t) = 2
LC_ALL = fr-FR
LC_CTYPE = fr-FR
----
cout : CU·I
wcout : ----
cout : ªªªIªeª|¡£
wcout : ----
cout : ?u¿
wcout : ----
cout : resume
wcout : resume
----
cout : r?sum?
wcout : resume
It turns out If I don't give the right locale, application fails to handle certain range of characters, no matter I used char or wchar_t. That's not only problem. Visual studio gives warning:
warning C4566: character represented by universal-character-name '\u4F60' cannot be represented in the current code page (949)
I'm not sure if this is describing what I'm getting as output or something else.
Question. What would be the best practices and why? How one can make an application platform/implementation/nation independent? what exactly happens to string literals on the source? how are string values are interpreted by application?
C++ doesn't have normal Unicode support. You just can't wirte normally globalized application in C++ without using 3rd party libraries. Read this insightful SO answer. If you really need to write an application which uses Unicode I'd look at ICU library.
On Windows, Microsoft guarantees that wchar_t
supports Unicode, so L"핼로"
is the correct way to produce a UTF-16 string literal as a const wchar_t*
. On other platforms, this doesn't necessarily hold, and you should use the C++11 Unicode string literals (u8"..."
, u"..."
, and U"..."
) if you need your code to be portable—e.g., use u8"핼로"
to produce a UTF-8 encoded const char*
(as of Visual Studio 2015).
The other problem you are encountering is with how Visual Studio interprets the encoding of your source file. For example, お
is encoded as 0xAA 0xAA
in EUC-KR (code page 949), which is the encoding for ªª
in code page 1252 (fr-FR)—that is, if you saved your source file containing お
in EUC-KR but compile it in an fr-FR locale, your literal will encode ªª
.
If you need to include non-ASCII characters in your source, you should save them with in a UTF (i.e., UTF-8/16/32) with an explicit BOM—described in the answer to this question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With