Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to safely convert const char* to const char8_t* in C++20?

From this answer I learned that in C++17 we can open std::fstream using a UTF-8 path via std::filesystem::u8path. But in C++20 this function is deprecated, and we are supposed to pass const char8_t* to std::filesystem::path constructor instead.

Here comes the problem: although we can legally convert (via reinterpret_cast) any pointer to const char*, we can't do backwards: from const char* to e.g. const char8_t* (it would break strict aliasing rules). So if we have some external API returning a char-based UTF-8 representation of the filename (e.g. from a library written in C), we can't safely convert the pointer to char8_t-based one.

So, how are we supposed to convert such char-based view of UTF-8 strings to char8_t-based view of them?

like image 211
Ruslan Avatar asked Aug 22 '19 06:08

Ruslan


2 Answers

Disclaimer: I'm the author of the P0482 proposal that introduced char8_t and deprecated u8path.

Your observations are correct; it is not permissible to use reinterpret_cast to produce a char8_t pointer to a sequence of char objects. This is discussed further at https://stackoverflow.com/a/57453713/11634221.

Though std::filesystem::u8path has been deprecated in C++20, there are no plans for its imminent removal; you can continue to use it. Further, P1423 corrects an unintended consequence of the changes in P0482 and permits it to be called with ranges of both char and char8_t in C++20. As far as I'm aware, no implementors have annotated std::filesystem::u8path as deprecated (I don't know if any plan to do so).

There is no (well-formed) way to produce a char8_t pointer based view of a sequence of char. It is possible to write a range/iterator adapter that, internally, converts the individual char values to char8_t on iterator dereference. Such an adapter could satisfy the requirements of the C++17 and C++20 random access iterator requirements for a non-mutable iterator (it can't satisfy requirements for a mutable iterator because the dereference operation wouldn't be able to provide an lvalue, nor could it satisfy requirements for a contiguous iterator). Such an adapter would suffice for calls to the std::filesystem::path constructors that accept ranges. Hmm, this might be a useful enough adapter to add to https://github.com/tahonermann/char8_t-remediation.

An alternative to a view over the underlying char data is, of course, to copy it, but I can appreciate why doing so might be considered undesirable (we already tend to do a lot of copying when working with std::filesystem::path).

like image 59
Tom Honermann Avatar answered Nov 15 '22 16:11

Tom Honermann


From this character types reference about char8_t:

It has the same size, signedness, and alignment as unsigned char (and. therefore, the same size and alignment as char and signed char), but is a distinct type.

Because it's a distinct type you can not convert from const char* to const char8_t* without breaking strict aliasing. But for all practical purposes, since char8_t is basically a unsigned char you can use reinterpret_cast to convert the pointer. It's wrong but will work.

For proper correctness either use char8_t to begin with, or copy the original characters into a char8_t buffer (or std::u8string).

like image 1
Some programmer dude Avatar answered Nov 15 '22 14:11

Some programmer dude