Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ICU iterate codepoints

Tags:

c++

icu

My objective is to iterate strings of Unicode text character by character but the code below is iterating code units instead of code points even though I am using next32PostInc() which is supposed to iterate code points:

void iterate_codepoints(UCharCharacterIterator &it, std::string &str) {
    UChar32 c;
    while (it.hasNext()) {
        c = it.next32PostInc();
        str += c;
    }
}

void my_test() {
    const char testChars[] = "\xE6\x96\xAF"; // Chinese character 斯 in UTF-8
    UnicodeString testString(testChars, "");
    const UChar *testText = testString.getTerminatedBuffer();

    UCharCharacterIterator iter(testText, u_strlen(testText));

    std::string str;
    iterate_codepoints(iter, str);
    std::cout << str; // outputs 斯 in UTF-8 format
}


int main() {
    my_test();
    return 0;
}

The code above produces the correct output which is the Chinese character 斯 but 3 iterations are occurring for this single character instead of just 1. Can someone explain what I am doing wrong?

In a nutshell, I just want to traverse characters in a loop and will be happy to use whichever ICU iteration classes are necessary.

Still trying to resolve this...

I also observed some bad behavior using UnicodeString as seen below. I am using VC++ 2013.

void test_02() {
    //  UnicodeString us = "abc 123 ñ";     // results in good UTF-8: 61 62 63 20 31 32 33 20 c3 b1  
    //  UnicodeString us = "斯";             // results in bad  UTF-8: 3f
    //  UnicodeString us = "abc 123 ñ 斯";  // results in bad  UTF-8: 61 62 63 20 31 32 33 20 c3 b1 20 3f  (only the last part '3f' is corrupt)
    //  UnicodeString us = "\xE6\x96\xAF";  // results in bad  UTF-8: 00 55 24 04 c4 00 24  
    //  UnicodeString us = "\x61";          // results in good UTF-8: 61
    //  UnicodeString us = "\x61\x62\x63";  // results in good UTF-8: 61 62 63
    //  UnicodeString us = "\xC3\xB1";      // results in bad  UTF-8: c3 83 c2 b1  
    UnicodeString us = "ñ";                 // results in good UTF-8: c3 b1    
    std::string cs;
    us.toUTF8String(cs);
    std::cout << cs; // output result to file, i.e.: main >output.txt

}

I am using VC++ 2013.

like image 785
Caroline Beltran Avatar asked Oct 19 '14 02:10

Caroline Beltran


1 Answers

Since your source data is UTF-8, you need to tell that to UnicodeString. Its constructor has a codepage parameter for that purpose, but you are setting it to a blank string:

UnicodeString testString(testChars, "");

That tells UnicodeString to perform an invariant conversion, which is not what you want. You end up with 3 codepoints (U+00E6 U+0096 U+00AF) instead of 1 codepoint (U+65AF), which is why your loop iterates three times.

You need to change your constructor call to let UnicodeString know the data is UTF-8, eg:

UnicodeString testString(testChars, "utf-8");
like image 64
Remy Lebeau Avatar answered Oct 21 '22 20:10

Remy Lebeau