Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

how std::u8string will be different from std::string?

Tags:

c++

string

unicode

c++20

If I have a string:

std::string s = u8"你好";

and in C++20,

std::u8string s = u8"你好";

how std::u8string will be different from std::string?

like image

738

asked Jun 03 '19 03:06

user963241

People also ask

What is the difference between std::string and string?

std::string is the string class from the standard C++ library. String is some other string class from some other library. It's hard to say from which library, because there are many different libraries that have their own class called String.

Should I use Wstring or string?

These are the two classes that you will actually use. std::string is used for standard ascii and utf-8 strings. std::wstring is used for wide-character/unicode (utf-16) strings. There is no built-in class for utf-32 strings (though you should be able to extend your own from basic_string if you need one).

Why do we use std::string?

std::string class in C++ C++ has in its definition a way to represent a sequence of characters as an object of the class. This class is called std:: string. String class stores the characters as a sequence of bytes with the functionality of allowing access to the single-byte character.

Does C++ string support Unicode?

C++ provides a wide-character type, wchar_t , which can store Unicode strings. The exact implementation of wchar_t is implementation defined, but it is often UTF-32. The class wstring , defined in <string> , is a sequence of wchar_t s, just like the string class is a sequence of char s.

1 Answers

Since the difference between u8string and string is that one is templated on char8_t and the other on char, the real question is what is the difference between using char8_t-based strings vs. char-based strings.

It really comes down to this: type-based encoding.

Any char-based string (char*, char[], string, etc) may be encoded in UTF-8. But then again, it may not. You could develop your code under an assumption that every char* equivalent will be UTF-8 encoded. And you could write a u8 in front of every string literal and/or otherwise ensure they're properly encoded. But:

Other people's code may not agree. So you can't use any library that might return char*s that don't use UTF-8 encoding.
You might accidentally violate your own precepts. After all, char not_utf8[] = "你好"; is conditionally supported C++. The encoding of that char[] will be the compiler's narrow encoding... whatever that is. It may be UTF-8 on some compilers and something else on others.
You can't tell other people's code (or even other people on your team) that this is what you're doing. That is, your API cannot declare that a particular char* is UTF-8-encoded. This has to be something the user assumes or has otherwise read in your documentation, rather than something they see in code.

Note that none of these problems exist for users of UTF-16 or UTF-32. If you use a char16_t-based string, all of these problems go away. If other people's code returns a char16_t string, you know what they're doing. If they return something else, then you know that those things probably aren't UTF-16. Your UTF-16-based code can interop with theirs. If you write an API that returns a char16_t-based string, everyone using your code can see from the type of the string what encoding it is. And this is guaranteed to be a compile error: char16_t not_utf16[] = "你好";

Now yes, there is no guarantee of any of these things. Any particular char16_t string could have any values in it, even those that are illegal for UTF-16. But char16_t represents a type for which the default assumption is a specific encoding. Given that, if you present a string with this type that isn't UTF-16 encoded, it would not be unreasonable to consider this a mistake/perfidy by the user, that it is a contract violation.

We can see how C++ has been impacted by lacking similar, type-based facilities for UTF-8. Consider filesystem::path. It can take strings in any Unicode encoding. For UTF-16/32, path's constructor takes char16/32_t-based strings. But you cannot pass a UTF-8 string to path's constructor; the char-based constructor assumes that the encoding is the implementation-defined narrow encoding, not UTF-8. So instead, you have to employ filesystem::u8path, which is a separate function that returns a path, constructed from a UTF-8-encoded string.

What's worse is that if you try to pass a UTF-8 encoded char-based string to path's constructor... it compiles fine. Despite being at best non-portable, it may just appear to work.

char8_t, and all of its accoutrements like u8string, exist to allow UTF-8 users the same power that other UTF-encodings get. In C++20, filesystem::path will get overloads for char8_t-based strings, and u8path will become obsolete.

And, as an added bonus, char8_t doesn't have special aliasing language around it. So an API that takes char8_t-based strings is certainly an API that takes a character array, rather than an arbitrary byte array.

like image

91

answered Nov 10 '22 12:11

Nicol Bolas

Sign in to Comment

Related questions
                            
                                C++ macro to log every line of code
                            
                                Does writing the same value to the same memory location cause a data race?
                            
                                adding the components of an SSE register
                            
                                What is HMODULE?
                            
                                How to speed up series generation?
                            
                                getaddrinfo memory leak
                            
                                CMake imported library behaviour
                            
                                If operator< works properly for floating-point types, why can't we use it for equality testing?
                            
                                Does an lvalue argument prefer an lvalue reference parameter over a universal reference?
                            
                                Eigen library --> initialize matrix with data from file or existing std::vector<string> content (c++)
                            
                                Qt - emit a signal from a c++ thread
                            
                                Ambiguous reference to namespace within an inline namespace
                            
                                Making a particular bit "0" in C++ [duplicate]
                            
                                Simulate long exposure from video frames OpenCV
                            
                                Pointer to function members: what does `R(*C::*)(Args...)` mean?
                            
                                New and delete operators override in libraries
                            
                                What does std::cout << std::cin do?
                            
                                Passing literal as a const ref parameter
                            
                                pybind11: how to package c++ and python code into a single package?
                            
                                For the erase-remove idiom, why is the second parameter necessary which points to the end of the container?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With