Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incorrect behaviour of size() and at() in string class

Tags:

c++

I've got this code:

string test("żaba");

cout << "Word: " << test << endl;
cout << "Length: " << test.size() << endl;
cout << "Letter: " << test.at(0) << endl;

The output is strange:

Word: żaba
Length: 5
Letter: �

As you can see, length should be 4 and letter: "ż".

How can I correct this code to work properly?

like image 454
Daniel Gadawski Avatar asked May 13 '12 09:05

Daniel Gadawski


2 Answers

std::string on non-Windows is usually used to store UTF8 strings (being the default encoding on most sane operating systems this side of 2010), but it is a "dumb" container that in the sense that it doesn't know or care anything about the bytes you're storing. It'll work for reading, storing, and writing; but not for string manipulation.

You need to use the excellent and well-maintained IBM ICU: International Components for Unicode. It's a C/C++ library for *nix or Windows into which a ton of research has gone to provide a culture-aware string library, including case-insensitive string comparison that's both fast and accurate.

Another good project that's easier to switch to for C++ devs is UTF8-CPP

like image 29
Mahmoud Al-Qudsi Avatar answered Sep 28 '22 16:09

Mahmoud Al-Qudsi


Your question fails to mention encodings so I’m going to take a stab in the dark and say that this is the reason.

First course of action: read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

After that, it should become clear that such a thing as a “naked string” doesn’t exist – every string is encoded somehow. In your case, it looks very much like you are using a UTF-8-encoded string with diacritics, in which case, yes, the length of the string is (correctly) reported as 51, and the first code point might not be printable on your platform.


1) Note that string::size counts bytes (= chars), not logical characters or even code points.

like image 183
Konrad Rudolph Avatar answered Sep 28 '22 15:09

Konrad Rudolph