Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how std::u8string will be different from std::string?

If I have a string:

std::string s = u8"你好";

and in C++20,

std::u8string s = u8"你好";

how std::u8string will be different from std::string?

like image 738
user963241 Avatar asked Jun 03 '19 03:06

user963241


People also ask

What is the difference between std::string and string?

std::string is the string class from the standard C++ library. String is some other string class from some other library. It's hard to say from which library, because there are many different libraries that have their own class called String.

Should I use Wstring or string?

These are the two classes that you will actually use. std::string is used for standard ascii and utf-8 strings. std::wstring is used for wide-character/unicode (utf-16) strings. There is no built-in class for utf-32 strings (though you should be able to extend your own from basic_string if you need one).

Why do we use std::string?

std::string class in C++ C++ has in its definition a way to represent a sequence of characters as an object of the class. This class is called std:: string. String class stores the characters as a sequence of bytes with the functionality of allowing access to the single-byte character.

Does C++ string support Unicode?

C++ provides a wide-character type, wchar_t , which can store Unicode strings. The exact implementation of wchar_t is implementation defined, but it is often UTF-32. The class wstring , defined in <string> , is a sequence of wchar_t s, just like the string class is a sequence of char s.


1 Answers

Since the difference between u8string and string is that one is templated on char8_t and the other on char, the real question is what is the difference between using char8_t-based strings vs. char-based strings.

It really comes down to this: type-based encoding.

Any char-based string (char*, char[], string, etc) may be encoded in UTF-8. But then again, it may not. You could develop your code under an assumption that every char* equivalent will be UTF-8 encoded. And you could write a u8 in front of every string literal and/or otherwise ensure they're properly encoded. But:

  1. Other people's code may not agree. So you can't use any library that might return char*s that don't use UTF-8 encoding.

  2. You might accidentally violate your own precepts. After all, char not_utf8[] = "你好"; is conditionally supported C++. The encoding of that char[] will be the compiler's narrow encoding... whatever that is. It may be UTF-8 on some compilers and something else on others.

  3. You can't tell other people's code (or even other people on your team) that this is what you're doing. That is, your API cannot declare that a particular char* is UTF-8-encoded. This has to be something the user assumes or has otherwise read in your documentation, rather than something they see in code.

Note that none of these problems exist for users of UTF-16 or UTF-32. If you use a char16_t-based string, all of these problems go away. If other people's code returns a char16_t string, you know what they're doing. If they return something else, then you know that those things probably aren't UTF-16. Your UTF-16-based code can interop with theirs. If you write an API that returns a char16_t-based string, everyone using your code can see from the type of the string what encoding it is. And this is guaranteed to be a compile error: char16_t not_utf16[] = "你好";

Now yes, there is no guarantee of any of these things. Any particular char16_t string could have any values in it, even those that are illegal for UTF-16. But char16_t represents a type for which the default assumption is a specific encoding. Given that, if you present a string with this type that isn't UTF-16 encoded, it would not be unreasonable to consider this a mistake/perfidy by the user, that it is a contract violation.

We can see how C++ has been impacted by lacking similar, type-based facilities for UTF-8. Consider filesystem::path. It can take strings in any Unicode encoding. For UTF-16/32, path's constructor takes char16/32_t-based strings. But you cannot pass a UTF-8 string to path's constructor; the char-based constructor assumes that the encoding is the implementation-defined narrow encoding, not UTF-8. So instead, you have to employ filesystem::u8path, which is a separate function that returns a path, constructed from a UTF-8-encoded string.

What's worse is that if you try to pass a UTF-8 encoded char-based string to path's constructor... it compiles fine. Despite being at best non-portable, it may just appear to work.

char8_t, and all of its accoutrements like u8string, exist to allow UTF-8 users the same power that other UTF-encodings get. In C++20, filesystem::path will get overloads for char8_t-based strings, and u8path will become obsolete.

And, as an added bonus, char8_t doesn't have special aliasing language around it. So an API that takes char8_t-based strings is certainly an API that takes a character array, rather than an arbitrary byte array.

like image 91
Nicol Bolas Avatar answered Nov 10 '22 12:11

Nicol Bolas