Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to measure the correct size of non-ASCII characters?

In the following program, I'm trying to measure the length of a string with non-ASCII characters.

But, I'm not sure why the size() doesn't print the correct length when using non-ASCII characters.

#include <iostream>
#include <string>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
    std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}

Output:

Size of Hello is 5
Size of इंडिया is 18

Live demo Wandbox.

like image 221
msc Avatar asked Oct 26 '17 06:10

msc


1 Answers

std::string::size returns the length in bytes, not in number of characters. Your second string uses an UNICODE encoding, so it may take several bytes per character. Note that the same applies to std::wstring::size since it will depend on the encoding (it returns the number of wide-chars, not actual characters: if UTF-16 is used it will match but not necessarily for other encodings, more in this answer).

To measure the actual length (in number of symbols) you need to know the encoding in order to separate (and therefore count) the characters correctly. This answer may be helpful for UTF-8 for example (although the method used is deprecated in C++17).

Another option for UTF-8 is to count the number of first-bytes (credit to this other answer):

int utf8_length(const std::string& s) {
  int len = 0;
  for (auto c : s)
      len += (c & 0xc0) != 0x80;
  return len;
}
like image 68
cbuchart Avatar answered Oct 23 '22 03:10

cbuchart