Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c++ string literal still confusing

Tags:

c++

unicode

I've been reading some articles about Unicode and realized I'm still left confused what to exactly do about it.

As a c++ programmer on Windows platform, the disciplines given to me were mostly same from any teacher: always use Unicode character set; templatize it or use TCHAR if possible; prefer wchar_t, std::wstring over char, std::string.

#include <tchar.h>
#include <string>
typedef std::basic_string<TCHAR> tstring;
 // ...
static const char* const s_hello = "핼로"; // bad
static const wchar_t* const s_wchar_hello = L"핼로" // better
static LPCTSTR s_tchar_hello = TEXT("핼로") // even better
static const tstring s_tstring_hello( TEXT("핼로") ); // best

Somehow I messed up, and I lead myself to believe that If I say "something", that means it is ASCII formatted, and if I say L"something" it is Unicode. Then I read this:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in , called the underlying types.

So what? If my locale says start from codepage 949, the extend of wchar_t is from 949 + 2^(sizeof(wchar_t)*8)? And the way it speaks sounds like 'I don't care if your implementation of c++ use UTF encoding or what'.

At least, I could understand that everything depends on what locale the application is on. Thus I tested:

#define TEST_OSTREAM_PRINT(x) \
std::cout << "----" << std::endl; \
std::cout << "cout : " << x << std::endl; \
std::wcout << "wcout : " << L##x << std::endl;

int main()
{
    std::ostream& os = std::cout;

    std::cout << " * Info : " << std::endl
              << "     sizeof(char) : " << sizeof(char) << std::endl
              << "     sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl
              << "     littel endian? : " << IsLittelEndian() << std::endl;
    std::cout << " - LC_ALL: " << setlocale(LC_ALL, NULL) << std::endl;
    std::cout << " - LC_CTYPE: " << setlocale(LC_CTYPE, NULL) << std::endl;

    TEST_OSTREAM_PRINT("핼로");
    TEST_OSTREAM_PRINT("おはよう。");
    TEST_OSTREAM_PRINT("你好");
    TEST_OSTREAM_PRINT("resume");
    TEST_OSTREAM_PRINT("résumé");

    return 0;
}

Then output was:

Info
 sizeof(char) = 1
 sizeof(wchar_t) = 2
 LC_ALL = C
 LC_CTYPE = C
----
cout : 핼로
wcout : ----
cout : おはよう。
wcout : ----
cout : ?好
wcout : ----
cout : resume
wcout : resume
----
cout : r?sum?
wcout : r?um

Another output with Korean locale:

Info
 sizeof(char) = 1
 sizeof(wchar_t) = 2
 LC_ALL = Korean_Korea.949
 LC_CTYPE = Korean_Korea.949
----
cout : 핼로
wcout : 핼로
----
cout : おはよう。
wcout : おはよう。
----
cout : ?好
wcout : ----
cout : resume
wcout : resume
----
cout : r?sum?
wcout : resume

Another output:

Info
 sizeof(char) = 1
 sizeof(wchar_t) = 2
 LC_ALL = fr-FR
 LC_CTYPE = fr-FR
----
cout : CU·I
wcout : ----
cout : ªªªIªeª|¡£
wcout : ----
cout : ?u¿
wcout : ----
cout : resume
wcout : resume
----
cout : r?sum?
wcout : resume

It turns out If I don't give the right locale, application fails to handle certain range of characters, no matter I used char or wchar_t. That's not only problem. Visual studio gives warning:

warning C4566: character represented by universal-character-name '\u4F60' cannot be represented in the current code page (949)

I'm not sure if this is describing what I'm getting as output or something else.

Question. What would be the best practices and why? How one can make an application platform/implementation/nation independent? what exactly happens to string literals on the source? how are string values are interpreted by application?

like image 389
user2883715 Avatar asked May 07 '15 11:05

user2883715


2 Answers

C++ doesn't have normal Unicode support. You just can't wirte normally globalized application in C++ without using 3rd party libraries. Read this insightful SO answer. If you really need to write an application which uses Unicode I'd look at ICU library.

like image 147
ixSci Avatar answered Nov 11 '22 02:11

ixSci


On Windows, Microsoft guarantees that wchar_t supports Unicode, so L"핼로" is the correct way to produce a UTF-16 string literal as a const wchar_t*. On other platforms, this doesn't necessarily hold, and you should use the C++11 Unicode string literals (u8"...", u"...", and U"...") if you need your code to be portable—e.g., use u8"핼로" to produce a UTF-8 encoded const char* (as of Visual Studio 2015).

The other problem you are encountering is with how Visual Studio interprets the encoding of your source file. For example, is encoded as 0xAA 0xAA in EUC-KR (code page 949), which is the encoding for ªª in code page 1252 (fr-FR)—that is, if you saved your source file containing in EUC-KR but compile it in an fr-FR locale, your literal will encode ªª.

If you need to include non-ASCII characters in your source, you should save them with in a UTF (i.e., UTF-8/16/32) with an explicit BOM—described in the answer to this question.

like image 3
一二三 Avatar answered Nov 11 '22 01:11

一二三