Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ iterate utf-8 string with mixed length of characters

Tags:

c++

string

utf-8

I need to loop over a utf-8 string and get each character of the string. There might be different types of characters in the string, e.g. numbers with the length of one byte, Chinese characters with the length of three bytes, etc.

I looked at this post and it can do 80% of the job, except that when the string has 3-byte chinese characters before 1-byte numbers, it will see the numbers also as having 3 bytes and print the numbers as 1** where * is gibberish.

To give an example, if the string is '今天周五123', the result will be:





1**
2**
3**

where * is gibberish. However if the string is '123今天周五', the numbers will print out fine.

The minimally adapted code from the above mentioned post is copied here:

#include <iostream>
#include "utf8.h"

using namespace std;

int main() {    
    string text = "今天周五123";

    char* str = (char*)text.c_str();    // utf-8 string
    char* str_i = str;                  // string iterator
    char* end = str+strlen(str)+1;      // end iterator

    unsigned char symbol[5] = {0,0,0,0,0};

    cout << symbol << endl;

    do
    {
        uint32_t code = utf8::next(str_i, end); // get 32 bit code of a utf-8 symbol
        if (code == 0)
            continue;

        cout << "utf 32 code:" << code << endl;

        utf8::append(code, symbol); // initialize array `symbol`

        cout << symbol << endl;

    }
    while ( str_i < end );

    return 0;
}

Can anyone help me here? I am new to c++ and although I checked the documentation of utf8 cpp, I still have no idea where the problem is. I think the library was created to handle such issues where you have utf-8 encodings with different lengths, so there should be a way to do this... Have been struggling with this for two days...

like image 871
Hai Avatar asked Oct 15 '16 03:10

Hai


1 Answers

Insert

memset(symbol, 0, sizeof(symbol));

before

utf8::append(code, symbol);  

If this for some reason still doesn't work, or if you want to get rid of the lib, recognizing codepoints is not that complicated:

string text = "今天周五123";
for(size_t i = 0; i < text.length();)
{
    int cplen = 1;
    if((text[i] & 0xf8) == 0xf0) cplen = 4;
    else if((text[i] & 0xf0) == 0xe0) cplen = 3;
    else if((text[i] & 0xe0) == 0xc0) cplen = 2;
    if((i + cplen) > text.length()) cplen = 1;

    cout << text.substr(i, cplen) << endl;
    i += cplen;
}

With both solution, however, be aware that multi-cp glyphs exist, as well as cp's that can't be printed alone

like image 106
deviantfan Avatar answered Oct 13 '22 02:10

deviantfan