my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.
I found this information but I'm not sure how I can use it to do this:
The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character:
0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
How can I find the actual length of a UTF-8 encoded std::string? Thanks
The C++ String class has length() and size() function. These can be used to get the length of a string type object. To get the length of the traditional C like strings, we can use the strlen() function.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.
UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
UTF-8 is a variable-width character encoding standard that uses between one and four eight-bit bytes to represent all valid Unicode code points.
Count all first-bytes (the ones that don't match 10xxxxxx).
int len = 0; while (*s) len += (*s++ & 0xc0) != 0x80;
C++ knows nothing about encodings, so you can't expect to use a standard function to do this.
The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.
#include <clocale> #include <cstdlib> #include <iostream> #include <string> using namespace std; int main(int argc, char *argv[]) { string str(argv[1]); unsigned int strLen = str.length(); cout << "Length (char-values): " << strLen << '\n'; setlocale(LC_ALL, "en_US.utf8"); unsigned int u = 0; const char *c_str = str.c_str(); unsigned int charCount = 0; while(u < strLen) { u += mblen(&c_str[u], strLen - u); charCount += 1; } cout << "Length (characters): " << charCount << endl; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With