I need to get a substring of the first N characters in a std::string assumed to be utf8.
I learned the hard way that .substr
does not work... as... expected.
Reference: My strings probably look like this: mission:\n\n1億2千万匹
On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent. For both, size tracks the number of code units instead of the number of code points, or grapheme clusters.
string find in C++ String find is used to find the first occurrence of sub-string in the specified string being called upon. It returns the index of the first occurrence of the substring in the string from given starting position. The default value of starting position is 0.
In C++, a part of a string is referred to as a substring. substr is a C++ function for obtaining a substring (). There are two parameters in this function: pos and len. The pos parameter defines the substring's start location, while the len indicates the number of characters in the substring.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
I found this code and am just about to try it out.
std::string utf8_substr(const std::string& str, unsigned int start, unsigned int leng)
{
if (leng==0) { return ""; }
unsigned int c, i, ix, q, min=std::string::npos, max=std::string::npos;
for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
{
if (q==start){ min=i; }
if (q<=start+leng || leng==std::string::npos){ max=i; }
c = (unsigned char) str[i];
if (
//c>=0 &&
c<=127) i+=0;
else if ((c & 0xE0) == 0xC0) i+=1;
else if ((c & 0xF0) == 0xE0) i+=2;
else if ((c & 0xF8) == 0xF0) i+=3;
//else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
//else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
else return "";//invalid utf8
}
if (q<=start+leng || leng==std::string::npos){ max=i; }
if (min==std::string::npos || max==std::string::npos) { return ""; }
return str.substr(min,max);
}
Update: This worked well for my current issue. I had to mix it with a get-length-of-utf8encoded-stdsstring function.
This solution had some warnings spat at it by my compiler:
You could use the boost/locale library to convert the utf8 string into a wstring. And then use the normal .substr() approach:
#include <iostream>
#include <boost/locale.hpp>
std::string ucs4_to_utf8(std::u32string const& in)
{
return boost::locale::conv::utf_to_utf<char>(in);
}
std::u32string utf8_to_ucs4(std::string const& in)
{
return boost::locale::conv::utf_to_utf<char32_t>(in);
}
int main(){
std::string utf8 = u8"1億2千万匹";
std::u32string part = utf8_to_ucs4(utf8).substr(0,3);
std::cout<<ucs4_to_utf8(part)<<std::endl;
// prints : 1億2
return 0;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With