I'm happy to see the std::u16string
and std::u32string
in C++11, but I'm wondering why there is no std::u8string
to handle the UTF-8 case. I'm under the impression that std::string
is intended for UTF-8, but it doesn't seem to do it very well. What I mean is, doesn't std::string.length()
still return the size of the string's buffer rather than the number of characters in the string?
So, how is the length()
method of the standard strings defined for the new C++11 classes? Do they return the size of the string's buffer, the number of codepoints, or the number of characters (assuming a surrogate pair is 2 code points, but one character. Please correct me if I'm wrong)?
And what about size()
; isn't it equal to length()
?
See http://en.cppreference.com/w/cpp/string/basic_string/length for the source of my confusion.
So, I guess, my fundamental question is how does one use std::string
, std::u16string
, and std::u32string
and properly distinguish between buffer size, number of codepoints, and number of characters? If you use the standard iterators, are you iterating over bytes, codepoints, or characters?
While std::string has the size of 24 bytes, it allows strings up to 22 bytes(!!) with no allocation. To achieve this libc++ uses a neat trick: the size of the string is not saved as-is but rather in a special way: if the string is short (< 23 bytes) then it stores size() * 2 .
Each character occupies two bytes in memory. So when you ask for the length of a u16string each two bytes is counted as one character. They are, after all, two-byte (16bit) characters.
Example. In below example for std::string::size. The size of str is 22 bytes.
u16string
and u32string
are not "new C++11 classes". They're just typedefs of std::basic_string
for char16_t
and cha32_t
types.
length
is always equal to size
for any basic_string
. It is the number of T
's in the string, where T
is the template type for the basic_string
.
basic_string
is not Unicode aware in any way, shape, or form. It has no concept of codepoints, graphemes, Unicode characters, Unicode normalization, or anything of the kind. It is simply a ordered sequence of T
s. The only thing that is Unicode-aware about u16string
and u32string
is that they use the type returned by u""
and U""
literals. Thus, they can store Unicode-encoded strings, but they do nothing that requires knowledge of said encoding.
Iterators iterate over elements of T
, not "bytes, codepoints, or characters". If T
is char16_t
, then it will iterate over char16_t
s. If the string is UTF-16-encoded, then it is iterating over UTF-16 code units, not Unicode codepoints or bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With