Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ substring multi byte characters

I am having this std::string which contains some characters that span multiple bytes.

When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.

So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.

Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.

Thanks

like image 838
W. Goeman Avatar asked Dec 09 '22 23:12

W. Goeman


2 Answers

Simpler version. based on the solution provided Getting the actual length of a UTF-8 encoded std::string? by Marcelo Cantos

std::string substr(std::string originalString, int maxLength)
{
    std::string resultString = originalString;

    int len = 0;
    int byteCount = 0;

    const char* aStr = originalString.c_str();

    while(*aStr)
    {
        if( (*aStr & 0xc0) != 0x80 )
            len += 1;

        if(len>maxLength)
        {
            resultString = resultString.substr(0, byteCount);
            break;
        }
        byteCount++;
        aStr++;
    }

    return resultString;
}
like image 58
eugene Avatar answered Dec 26 '22 19:12

eugene


A std::string object is not a string of characters, it's a string of bytes. It has no notion of what's called "encoding" at all. Same goes for std::wstring, except that it's a string of 16bit values.

In order to perform operations on your text which require addressing distinct characters (as is the case when you want to take the substring, for instance) you need to know what encoding is used for your std::string object.

UPDATE: Now that you clarified that your input string is UTF-8 encoded, you still need to decide on an encoding to use for your output std::wstring. UTF-16 comes to mind, but it really depends on what the API which you will pass the std::wstring objects to expect. Assuming that UTF-16 is acceptable you have various choices:

  1. On Windows, you can use the MultiByteToWideChar function; no extra dependencies required.
  2. The UTF8-CPP library claims to provide a lightweight solution for dealing with UTF-* encoded strings. Never tried it myself, but I keep hearing good things about it.
  3. On Linux systems, using the libiconv library is quite common.
  4. If you need to deal with all sorts of crazy encodings and want the full-blown alpha-and-omega word as far as encodings go, look at ICU.
like image 21
Frerich Raabe Avatar answered Dec 26 '22 18:12

Frerich Raabe