my std::string is utf-8 encoded so obviously, str.length() returns the wrong result. I found this information but I'm not sure how I can use it to do this: <blockquote> The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character: <pre class="prettyprint"><code> 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx </code></pre> </blockquote> How can I find the actual length of a UTF-8 encoded std::string? Thanks

Count all first-bytes (the ones that don't match 10xxxxxx). <pre class="prettyprint"><code>int len = 0; while (*s) len += (*s++ & 0xc0) != 0x80; </code></pre>

Getting the actual length of a UTF-8 encoded std::string?

Tags:

algorithm

my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.

I found this information but I'm not sure how I can use it to do this:

The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character:
   0x00000000 - 0x0000007F:        0xxxxxxx     0x00000080 - 0x000007FF:        110xxxxx 10xxxxxx     0x00000800 - 0x0000FFFF:        1110xxxx 10xxxxxx 10xxxxxx     0x00010000 - 0x001FFFFF:        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 

How can I find the actual length of a UTF-8 encoded std::string? Thanks

464

asked Oct 31 '10 12:10

2 Answers

Count all first-bytes (the ones that don't match 10xxxxxx).

int len = 0; while (*s) len += (*s++ & 0xc0) != 0x80;

answered Sep 24 '22 23:09

Marcelo Cantos

C++ knows nothing about encodings, so you can't expect to use a standard function to do this.

The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.

#include <clocale> #include <cstdlib> #include <iostream> #include <string>  using namespace std;  int main(int argc, char *argv[]) {   string str(argv[1]);   unsigned int strLen = str.length();   cout << "Length (char-values): " << strLen << '\n';   setlocale(LC_ALL, "en_US.utf8");   unsigned int u = 0;   const char *c_str = str.c_str();   unsigned int charCount = 0;   while(u < strLen)   {     u += mblen(&c_str[u], strLen - u);     charCount += 1;   }   cout << "Length (characters): " << charCount << endl;  }

answered Sep 23 '22 23:09

user2781185

Related questions
                            
                                Does insertion to STL map invalidate other existing iterator?
                            
                                unordered_map thread safety
                            
                                Travis CI with Clang 3.4 and C++11
                            
                                Throwing out of range exception in C++
                            
                                Add same value multiple times to std::vector (repeat)
                            
                                Visual Studio - can be a breakpoint called from code?
                            
                                shared_ptr vs scoped_ptr
                            
                                How to include libraries in Visual Studio 2012?
                            
                                Modifying vertex properties in a Boost::Graph
                            
                                C++ diamond problem - How to call base method only once
                            
                                How to swap two numbers without using temp variables or arithmetic operations?
                            
                                Why is an assignment to a base class valid, but an assignment to a derived class a compilation error?
                            
                                Qt - copy a file from one directory to another
                            
                                Two enums have some elements in common, why does this produce an error?
                            
                                Difference between pointer and reference as thread parameter
                            
                                Is it possible to write a program without using main() function?
                            
                                How to remove last character put to std::cout?
                            
                                Am I using the copy_if wrong?
                            
                                How can I get double quotes into a string literal?
                            
                                Why doesn't this "undefined extern variable" result in a linker error in C++17?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With