In the following program, I'm trying to measure the length of a string with non-ASCII characters.
But, I'm not sure why the size()
doesn't print the correct length when using non-ASCII characters.
#include <iostream>
#include <string>
int main()
{
std::string s1 = "Hello";
std::string s2 = "इंडिया"; // non-ASCII string
std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}
Output:
Size of Hello is 5
Size of इंडिया is 18
Live demo Wandbox.
std::string::size
returns the length in bytes, not in number of characters. Your second string uses an UNICODE encoding, so it may take several bytes per character. Note that the same applies to std::wstring::size
since it will depend on the encoding (it returns the number of wide-chars, not actual characters: if UTF-16 is used it will match but not necessarily for other encodings, more in this answer).
To measure the actual length (in number of symbols) you need to know the encoding in order to separate (and therefore count) the characters correctly. This answer may be helpful for UTF-8 for example (although the method used is deprecated in C++17).
Another option for UTF-8 is to count the number of first-bytes (credit to this other answer):
int utf8_length(const std::string& s) {
int len = 0;
for (auto c : s)
len += (c & 0xc0) != 0x80;
return len;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With