Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Substring of a std::string in utf-8? C++11

I need to get a substring of the first N characters in a std::string assumed to be utf8. I learned the hard way that .substr does not work... as... expected.

Reference: My strings probably look like this: mission:\n\n1億2千万匹

like image 615
Jonny Avatar asked Jun 23 '15 06:06

Jonny


People also ask

Is std::string UTF-8?

On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent. For both, size tracks the number of code units instead of the number of code points, or grapheme clusters.

How do you find the position of a substring in a string C++?

string find in C++ String find is used to find the first occurrence of sub-string in the specified string being called upon. It returns the index of the first occurrence of the substring in the string from given starting position. The default value of starting position is 0.

What is substring of a string in C++?

In C++, a part of a string is referred to as a substring. substr is a C++ function for obtaining a substring (). There are two parameters in this function: pos and len. The pos parameter defines the substring's start location, while the len indicates the number of characters in the substring.

What is a UTF-8 encoded string?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”


2 Answers

I found this code and am just about to try it out.

std::string utf8_substr(const std::string& str, unsigned int start, unsigned int leng)
{
    if (leng==0) { return ""; }
    unsigned int c, i, ix, q, min=std::string::npos, max=std::string::npos;
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
    {
        if (q==start){ min=i; }
        if (q<=start+leng || leng==std::string::npos){ max=i; }

        c = (unsigned char) str[i];
        if      (
                 //c>=0   &&
                 c<=127) i+=0;
        else if ((c & 0xE0) == 0xC0) i+=1;
        else if ((c & 0xF0) == 0xE0) i+=2;
        else if ((c & 0xF8) == 0xF0) i+=3;
        //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
        //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
        else return "";//invalid utf8
    }
    if (q<=start+leng || leng==std::string::npos){ max=i; }
    if (min==std::string::npos || max==std::string::npos) { return ""; }
    return str.substr(min,max);
}

Update: This worked well for my current issue. I had to mix it with a get-length-of-utf8encoded-stdsstring function.

This solution had some warnings spat at it by my compiler:

Some warnings spit out by my compiler.

like image 117
Jonny Avatar answered Oct 06 '22 07:10

Jonny


You could use the boost/locale library to convert the utf8 string into a wstring. And then use the normal .substr() approach:

#include <iostream>
#include <boost/locale.hpp>

std::string ucs4_to_utf8(std::u32string const& in)
{
    return boost::locale::conv::utf_to_utf<char>(in);
}

std::u32string utf8_to_ucs4(std::string const& in)
{
    return boost::locale::conv::utf_to_utf<char32_t>(in);
}

int main(){

  std::string utf8 = u8"1億2千万匹";

  std::u32string part = utf8_to_ucs4(utf8).substr(0,3);

  std::cout<<ucs4_to_utf8(part)<<std::endl;
  // prints : 1億2
  return 0;
}
like image 29
Gunnar Klämke Avatar answered Oct 06 '22 06:10

Gunnar Klämke