How to measure the correct size of non-ASCII characters?

Question

In the following program, I'm trying to measure the length of a string with non-ASCII characters.

But, I'm not sure why the size() doesn't print the correct length when using non-ASCII characters.

#include <iostream>
#include <string>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
    std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}

Output:

Size of Hello is 5
Size of इंडिया is 18

Live demo Wandbox.

cbuchart · Accepted Answer

std::string::size returns the length in bytes, not in number of characters. Your second string uses an UNICODE encoding, so it may take several bytes per character. Note that the same applies to std::wstring::size since it will depend on the encoding (it returns the number of wide-chars, not actual characters: if UTF-16 is used it will match but not necessarily for other encodings, more in this answer).

To measure the actual length (in number of symbols) you need to know the encoding in order to separate (and therefore count) the characters correctly. This answer may be helpful for UTF-8 for example (although the method used is deprecated in C++17).

Another option for UTF-8 is to count the number of first-bytes (credit to this other answer):

int utf8_length(const std::string& s) {
  int len = 0;
  for (auto c : s)
      len += (c & 0xc0) != 0x80;
  return len;
}

How to measure the correct size of non-ASCII characters?

Tags:

c++

string

c++11

size

non-ascii-characters

msc

1 Answers

cbuchart

Recent Activity

Donate For Us

How to measure the correct size of non-ASCII characters?

Tags:

c++

string

c++11

size

non-ascii-characters

msc

1 Answers

cbuchart

Related questions

Recent Activity

Donate For Us