C++ iterate or split UTF-8 string into array of symbols?

Question

Searching for a platform- and 3rd-party-library- independent way of iterating UTF-8 string or splitting it into array of UTF-8 symbols.

Please post a code snippet.

Solved: C++ iterate or split UTF-8 string into array of symbols?

Mark Wilkins · Accepted Answer

If I understand correctly, it sounds like you want to find the start of each UTF-8 character. If so, then it would be fairly straightforward to parse them (interpreting them is a different matter). But the definition of how many octets are involved is well-defined by the RFC:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For example, if lb has the first octet of a UTF-8 character, I think the following would determine the number of octets involved.

unsigned char lb;

if (( lb & 0x80 ) == 0 )          // lead bit is zero, must be a single ascii
   printf( "1 octet
" );
else if (( lb & 0xE0 ) == 0xC0 )  // 110x xxxx
   printf( "2 octets
" );
else if (( lb & 0xF0 ) == 0xE0 ) // 1110 xxxx
   printf( "3 octets
" );
else if (( lb & 0xF8 ) == 0xF0 ) // 1111 0xxx
   printf( "4 octets
" );
else
   printf( "Unrecognized lead byte (%02x)
", lb );

Ultimately, though, you are going to be much better off using an existing library as suggested in another post. The above code might categorize the characters according to octets, but it doesn't help "do" anything with them once that is finished.

topright gamedev · Answer

Solved using tiny platform-independent UTF8 CPP library:

    char* str = (char*)text.c_str();    // utf-8 string
    char* str_i = str;                  // string iterator
    char* end = str+strlen(str)+1;      // end iterator

    do
    {
        uint32_t code = utf8::next(str_i, end); // get 32 bit code of a utf-8 symbol
        if (code == 0)
            continue;

        unsigned char[5] symbol = {0};
        utf8::append(code, symbol); // copy code to symbol

        // ... do something with symbol
    }
    while ( str_i < end );

Nemanja Trifunovic · Answer

UTF8 CPP is exactly what you want

Kirill V. Lyadvinsky · Answer

Try ICU Library.

C++ iterate or split UTF-8 string into array of symbols?

Tags:

c++

arrays

split

utf-8

topright gamedev

4 Answers

Mark Wilkins

topright gamedev

Nemanja Trifunovic

Kirill V. Lyadvinsky

Recent Activity

Donate For Us

C++ iterate or split UTF-8 string into array of symbols?

Tags:

c++

arrays

split

utf-8

topright gamedev

4 Answers

Mark Wilkins

topright gamedev

Nemanja Trifunovic

Kirill V. Lyadvinsky

Related questions

Recent Activity

Donate For Us