Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I embed unicode string constants in a source file?

I'm writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew etc.

The problem I have is that I cannot find a way to embed the expectations in the test source file: here's an example of what I'm trying to do...

///
/// Protected: TestGetHebrewConfigString
///  
void CPrIniFileReaderTest::TestGetHebrewConfigString()
{
    prwstring strHebrewTestFilePath = GetTestFilePath( strHebrewTestFileName );
    CPrIniFileReader prIniListReader( strHebrewTestFilePath.c_str() );
    prIniListReader.SetCurrentSection( strHebrewSubSection );   

    CPPUNIT_ASSERT( prIniListReader.GetConfigString( L"דונדארןמע" ) == L"דונהשךוק") );
}

This quite simply doesnt work. Previously I worked around this using a macro which calls a routine to transform a narrow string to a wide string (we use towstring all over the place in our applications so it's existing code)

#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )

wstring towstring( LPCSTR lpszValue )
{
    wostringstream os;
    os << lpszValue;
    return os.str();
}

The assertion in the test above then became:

CPPUNIT_ASSERT( prIniListReader.GetConfigString( UNICODE_CONSTANT( "דונדארןמע" ) ) == UNICODE_CONSTANT( "דונהשךוק" ) );

This worked OK on OS X but now I'm porting to linux and I'm finding that the tests are all failing: it all feels rather hackish as well. Can anyone tell me if they have a nicer solution to this problem?

like image 252
jkp Avatar asked Jan 14 '09 12:01

jkp


People also ask

What is Unicode string in c++?

UnicodeString is a string class that stores Unicode characters directly and provides similar functionality as the Java String and StringBuffer/StringBuilder classes. More...

What are Unicode strings in Python?

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

What is a Unicode string?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

Does c++ support Unicode?

C++ provides a wide-character type, wchar_t , which can store Unicode strings. The exact implementation of wchar_t is implementation defined, but it is often UTF-32.


2 Answers

A tedious but portable way is to build your strings using numeric escape codes. For example:

wchar_t *string = L"דונדארןמע";

becomes:

wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";

You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.

You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX, so just search & replace \u with \x to get the C format.

like image 119
fbonnet Avatar answered Oct 04 '22 12:10

fbonnet


You have to tell GCC which encoding your file uses to code those characters into the file.

Use the option -finput-charset=charset, for example -finput-charset=UTF-8. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset, for example -fwide-exec-charset=UTF-32. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t gcc uses.

You can adjust that. That option is mainly useful for compiling programs for wine, designed to be compatible with windows. The option is called -fshort-wchar, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.

Those options are described in more detail in man gcc, the gcc manpage.

like image 32
Johannes Schaub - litb Avatar answered Oct 04 '22 11:10

Johannes Schaub - litb