Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the actual length of a UTF-8 encoded std::string?

Tags:

c++

algorithm

my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.

I found this information but I'm not sure how I can use it to do this:

The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character:

   0x00000000 - 0x0000007F:        0xxxxxxx     0x00000080 - 0x000007FF:        110xxxxx 10xxxxxx     0x00000800 - 0x0000FFFF:        1110xxxx 10xxxxxx 10xxxxxx     0x00010000 - 0x001FFFFF:        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 

How can I find the actual length of a UTF-8 encoded std::string? Thanks

like image 464
jmasterx Avatar asked Oct 31 '10 12:10

jmasterx


People also ask

How do I find the length of a std::string?

The C++ String class has length() and size() function. These can be used to get the length of a string type object. To get the length of the traditional C like strings, we can use the strlen() function.

How many bytes is a string in UTF-8?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.

Does std::string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

Is UTF-8 variable length?

UTF-8 is a variable-width character encoding standard that uses between one and four eight-bit bytes to represent all valid Unicode code points.


2 Answers

Count all first-bytes (the ones that don't match 10xxxxxx).

int len = 0; while (*s) len += (*s++ & 0xc0) != 0x80; 
like image 81
Marcelo Cantos Avatar answered Sep 24 '22 23:09

Marcelo Cantos


C++ knows nothing about encodings, so you can't expect to use a standard function to do this.

The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.

#include <clocale> #include <cstdlib> #include <iostream> #include <string>  using namespace std;  int main(int argc, char *argv[]) {   string str(argv[1]);   unsigned int strLen = str.length();   cout << "Length (char-values): " << strLen << '\n';   setlocale(LC_ALL, "en_US.utf8");   unsigned int u = 0;   const char *c_str = str.c_str();   unsigned int charCount = 0;   while(u < strLen)   {     u += mblen(&c_str[u], strLen - u);     charCount += 1;   }   cout << "Length (characters): " << charCount << endl;  } 
like image 33
user2781185 Avatar answered Sep 23 '22 23:09

user2781185