Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

absent std::u8string in C++11

Why C++11 provides std::u16string and std::u32string and not std::u8string? We need to implement the utf-8 encoding or using additional libraries?

like image 547
Sergio Avatar asked Mar 20 '17 09:03

Sergio


2 Answers

C++20 adds char8_t and std::u8string. According to the proposal, the rationale is:

UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.

Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.

The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.

like image 162
lz96 Avatar answered Oct 24 '22 04:10

lz96


Because the C/C++ standard committees don't care about valid UTF-8 sequences and comparisons enough yet. For them strcmp((char*)utf8, (char*)other) is enough, even if they would be same if normalized, or even if one is invalid UTF-8.

Neither about proper identifiers, UTF-8 sequences that should be identifiable, like pathnames. For them "Café" is not the same as "Café", when they have different bytes. "e\x301" vs "\xe9". For u8ident that is wrong, for u8string it's arguable. At least validity needs to be checked, normalization can be cached. It's a rare case.

Not even coreutils can that yet properly, most filesystems treat names as binary, which is a security risk.

See e.g. https://crashcourse.housegordon.org/coreutils-multibyte-support.html or http://perl11.github.io/blog/foldcase.html

like image 1
rurban Avatar answered Oct 24 '22 06:10

rurban