Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ UTF-8 output with ICU

I'm struggling to get started with the C++ ICU library. I have tried to get the simplest example to work, but even that has failed. I would just like to output a UTF-8 string and then go from there.

Here is what I have:

#include <unicode/unistr.h>
#include <unicode/ustream.h>

#include <iostream>

int main()
{
    UnicodeString s = UNICODE_STRING_SIMPLE("привет");

    std::cout << s << std::endl;

    return 0;
}

Here is the output:

$ g++ -I/sw/include -licucore -Wall -Werror -o icu_test main.cpp 
$ ./icu_test 
пÑивеÑ

My terminal and font support UTF-8 and I regularly use the terminal with UTF-8. My source code is in UTF-8.

I think that perhaps I somehow need to set the output stream to UTF-8 because ICU stores strings as UTF-16, but I'm really not sure and I would have thought that the operators provided by ustream.h would do that anyway.

Any help would be appreciated, thank you.

like image 981
Isaac Avatar asked Apr 29 '10 17:04

Isaac


2 Answers

Your program will work if you just change the initializer to:

UnicodeString s("привет");

The macro you were using is only for strings that contain "invariant characters", i.e., only latin letters, digits, and some punctuation.

As was said before, input/output codepages are tricky. You said:

My terminal and font support UTF-8 and I regularly use the terminal with UTF-8. My source code is in UTF-8.

That may be true, but ICU doesn't know that's true. The process codepage might be different (let's say iso-8859-1), and the output codepage may be different (let's say shift-jis). Then, the program wouldn't work. But, the invariant characters using the API UNICODE_STRING_SIMPLE would still work.

Hope this helps.

srl, icu dev

like image 153
Steven R. Loomis Avatar answered Nov 17 '22 03:11

Steven R. Loomis


What happens if you write the output to a file (either redirecting using pipes from the terminal, or by opening a file stream in the program itself)

That would determine whether or not it is the terminal that fails to handle the output correctly.

What happens if you inspect the output string in the debugger? Does it contain the correct values? Find out what the UTF-8 encoding of your string should look like, and compare it against what you get in the debugger. Or print out the integral value of each byte, and verify that those are correct.

When working with encoding it is always tricky (but essential) to determine whether the problem lies in your program itself or in the conversion that happens when the text is output to the system. Take the terminal out of the equation and verify that your program generates the correct output.

like image 3
jalf Avatar answered Nov 17 '22 02:11

jalf