I'm writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew etc.
The problem I have is that I cannot find a way to embed the expectations in the test source file: here's an example of what I'm trying to do...
///
/// Protected: TestGetHebrewConfigString
///
void CPrIniFileReaderTest::TestGetHebrewConfigString()
{
prwstring strHebrewTestFilePath = GetTestFilePath( strHebrewTestFileName );
CPrIniFileReader prIniListReader( strHebrewTestFilePath.c_str() );
prIniListReader.SetCurrentSection( strHebrewSubSection );
CPPUNIT_ASSERT( prIniListReader.GetConfigString( L"דונדארןמע" ) == L"דונהשךוק") );
}
This quite simply doesnt work. Previously I worked around this using a macro which calls a routine to transform a narrow string to a wide string (we use towstring all over the place in our applications so it's existing code)
#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )
wstring towstring( LPCSTR lpszValue )
{
wostringstream os;
os << lpszValue;
return os.str();
}
The assertion in the test above then became:
CPPUNIT_ASSERT( prIniListReader.GetConfigString( UNICODE_CONSTANT( "דונדארןמע" ) ) == UNICODE_CONSTANT( "דונהשךוק" ) );
This worked OK on OS X but now I'm porting to linux and I'm finding that the tests are all failing: it all feels rather hackish as well. Can anyone tell me if they have a nicer solution to this problem?
UnicodeString is a string class that stores Unicode characters directly and provides similar functionality as the Java String and StringBuffer/StringBuilder classes. More...
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.
Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.
C++ provides a wide-character type, wchar_t , which can store Unicode strings. The exact implementation of wchar_t is implementation defined, but it is often UTF-32.
A tedious but portable way is to build your strings using numeric escape codes. For example:
wchar_t *string = L"דונדארןמע";
becomes:
wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";
You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.
You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX
, so just search & replace \u
with \x
to get the C format.
You have to tell GCC which encoding your file uses to code those characters into the file.
Use the option -finput-charset=charset
, for example -finput-charset=UTF-8
. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset
, for example -fwide-exec-charset=UTF-32
. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t
gcc uses.
You can adjust that. That option is mainly useful for compiling programs for wine
, designed to be compatible with windows. The option is called -fshort-wchar
, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.
Those options are described in more detail in man gcc
, the gcc manpage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With