How can I embed unicode string constants in a source file?

Tags:

I'm writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew etc.

The problem I have is that I cannot find a way to embed the expectations in the test source file: here's an example of what I'm trying to do...

///
/// Protected: TestGetHebrewConfigString
///  
void CPrIniFileReaderTest::TestGetHebrewConfigString()
{
    prwstring strHebrewTestFilePath = GetTestFilePath( strHebrewTestFileName );
    CPrIniFileReader prIniListReader( strHebrewTestFilePath.c_str() );
    prIniListReader.SetCurrentSection( strHebrewSubSection );   

    CPPUNIT_ASSERT( prIniListReader.GetConfigString( L"דונדארןמע" ) == L"דונהשךוק") );
}

This quite simply doesnt work. Previously I worked around this using a macro which calls a routine to transform a narrow string to a wide string (we use towstring all over the place in our applications so it's existing code)

#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )

wstring towstring( LPCSTR lpszValue )
{
    wostringstream os;
    os << lpszValue;
    return os.str();
}

The assertion in the test above then became:

CPPUNIT_ASSERT( prIniListReader.GetConfigString( UNICODE_CONSTANT( "דונדארןמע" ) ) == UNICODE_CONSTANT( "דונהשךוק" ) );

This worked OK on OS X but now I'm porting to linux and I'm finding that the tests are all failing: it all feels rather hackish as well. Can anyone tell me if they have a nicer solution to this problem?

252

asked Jan 14 '09 12:01

jkp

2 Answers

A tedious but portable way is to build your strings using numeric escape codes. For example:

wchar_t *string = L"דונדארןמע";

becomes:

wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";

You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.

You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX, so just search & replace \u with \x to get the C format.

119

answered Oct 04 '22 12:10

fbonnet

You have to tell GCC which encoding your file uses to code those characters into the file.

Use the option -finput-charset=charset, for example -finput-charset=UTF-8. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset, for example -fwide-exec-charset=UTF-32. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t gcc uses.

You can adjust that. That option is mainly useful for compiling programs for wine, designed to be compatible with windows. The option is called -fshort-wchar, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.

Those options are described in more detail in man gcc, the gcc manpage.

answered Oct 04 '22 11:10

Johannes Schaub - litb

Related questions
                            
                                Linker error: undefined reference to symbol 'pthread_rwlock_trywrlock@@GLIBC_2.2.5'
                            
                                QT - Adding widgets to horizontal layout
                            
                                how to optimize C++/C code for a large number of integers
                            
                                Why isn't memcpy guaranteed to be safe for non-POD types?
                            
                                String concatenation, why does this compile?
                            
                                Combine enums c++
                            
                                ZLib Inflate() failing with -3 Z_DATA_ERROR
                            
                                unordered_set non const iterator
                            
                                How to generate 64 bit random numbers?
                            
                                Get Predecessors for BasicBlock in LLVM
                            
                                How should I count the number of unique rows in a 'binary' matrix?
                            
                                Const char array with template argument size vs. char pointer
                            
                                The concept of `Nil` in C++
                            
                                C++ universal function caller
                            
                                Is there a way to assign zero to std::chrono::nanoseconds
                            
                                string allocation in C++: why does this work? [duplicate]
                            
                                C++ copy constructor using pointers
                            
                                C++ Send data in body with Boost.asio and Beast library
                            
                                What is this? catching polymorphic type X by value [-Wcatch-value=]
                            
                                What C++ compilers are supporting lambda already?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I embed unicode string constants in a source file?

Tags:

c++

string

unit-testing

constants

unicode

jkp

People also ask

2 Answers

fbonnet

Johannes Schaub - litb

Recent Activity

Donate For Us