std::u16string, std::u32string, std::string, length(), size(), codepoints and characters

Q: What is u16string?

Each character occupies two bytes in memory. So when you ask for the length of a u16string each two bytes is counted as one character. They are, after all, two-byte (16bit) characters.

Q: How many bytes is std::string?

Example. In below example for std::string::size. The size of str is 22 bytes.

Tags:

I'm happy to see the std::u16string and std::u32string in C++11, but I'm wondering why there is no std::u8string to handle the UTF-8 case. I'm under the impression that std::string is intended for UTF-8, but it doesn't seem to do it very well. What I mean is, doesn't std::string.length() still return the size of the string's buffer rather than the number of characters in the string?

So, how is the length() method of the standard strings defined for the new C++11 classes? Do they return the size of the string's buffer, the number of codepoints, or the number of characters (assuming a surrogate pair is 2 code points, but one character. Please correct me if I'm wrong)?

And what about size(); isn't it equal to length()? See http://en.cppreference.com/w/cpp/string/basic_string/length for the source of my confusion.

So, I guess, my fundamental question is how does one use std::string, std::u16string, and std::u32string and properly distinguish between buffer size, number of codepoints, and number of characters? If you use the standard iterators, are you iterating over bytes, codepoints, or characters?

860

asked Sep 03 '12 16:09

Verax

1 Answers

u16string and u32string are not "new C++11 classes". They're just typedefs of std::basic_string for char16_t and cha32_t types.

length is always equal to size for any basic_string. It is the number of T's in the string, where T is the template type for the basic_string.

basic_string is not Unicode aware in any way, shape, or form. It has no concept of codepoints, graphemes, Unicode characters, Unicode normalization, or anything of the kind. It is simply a ordered sequence of Ts. The only thing that is Unicode-aware about u16string and u32string is that they use the type returned by u"" and U"" literals. Thus, they can store Unicode-encoded strings, but they do nothing that requires knowledge of said encoding.

Iterators iterate over elements of T, not "bytes, codepoints, or characters". If T is char16_t, then it will iterate over char16_ts. If the string is UTF-16-encoded, then it is iterating over UTF-16 code units, not Unicode codepoints or bytes.

100

answered Sep 21 '22 15:09

Nicol Bolas

Related questions
                            
                                process video stream from memory buffer
                            
                                How can C++ virtual functions be implemented except vtable? [duplicate]
                            
                                C++: How to add raw binary data into source with Visual Studio?
                            
                                Qt Creator Code File Refactoring
                            
                                What is decltype(0 + 0)?
                            
                                Signed right shift: which compiler use logical shift
                            
                                Understanding C++ member function template specialization
                            
                                QSplitter: How to make second column smaller?
                            
                                Print stream value in gdb - C++
                            
                                Casting "number 0" to char before appending it
                            
                                Side-by-Side configuration incorrect due to incorrect manifest
                            
                                C++: struct and new keyword
                            
                                Recommendations for C/C++ remote message queues
                            
                                How to print unsigned char[] as HEX in C++?
                            
                                Can I force a compiler error if certain functions are called?
                            
                                Why doesn't a derived class use the base class operator= (assignment operator)?
                            
                                Integrate Google Protocol Buffers .proto files to Visual C++ 2010
                            
                                How to handle subprojects with autotools?
                            
                                OpenGL/DirectX Hook - Similar to FRAPS
                            
                                Virtual Functions during Construction. Why Java is different than C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

std::u16string, std::u32string, std::string, length(), size(), codepoints and characters

Tags:

c++

unicode

Verax

People also ask

1 Answers

Nicol Bolas

Recent Activity

Donate For Us