From this answer I learned that in C++17 we can open std::fstream
using a UTF-8 path via std::filesystem::u8path
. But in C++20 this function is deprecated, and we are supposed to pass const char8_t*
to std::filesystem::path
constructor instead.
Here comes the problem: although we can legally convert (via reinterpret_cast
) any pointer to const char*
, we can't do backwards: from const char*
to e.g. const char8_t*
(it would break strict aliasing rules). So if we have some external API returning a char
-based UTF-8 representation of the filename (e.g. from a library written in C), we can't safely convert the pointer to char8_t
-based one.
So, how are we supposed to convert such char
-based view of UTF-8 strings to char8_t
-based view of them?
Disclaimer: I'm the author of the P0482 proposal that introduced char8_t
and deprecated u8path
.
Your observations are correct; it is not permissible to use reinterpret_cast
to produce a char8_t
pointer to a sequence of char
objects. This is discussed further at https://stackoverflow.com/a/57453713/11634221.
Though std::filesystem::u8path
has been deprecated in C++20, there are no plans for its imminent removal; you can continue to use it. Further, P1423 corrects an unintended consequence of the changes in P0482 and permits it to be called with ranges of both char
and char8_t
in C++20. As far as I'm aware, no implementors have annotated std::filesystem::u8path
as deprecated (I don't know if any plan to do so).
There is no (well-formed) way to produce a char8_t
pointer based view of a sequence of char
. It is possible to write a range/iterator adapter that, internally, converts the individual char
values to char8_t
on iterator dereference. Such an adapter could satisfy the requirements of the C++17 and C++20 random access iterator requirements for a non-mutable iterator (it can't satisfy requirements for a mutable iterator because the dereference operation wouldn't be able to provide an lvalue, nor could it satisfy requirements for a contiguous iterator). Such an adapter would suffice for calls to the std::filesystem::path
constructors that accept ranges. Hmm, this might be a useful enough adapter to add to https://github.com/tahonermann/char8_t-remediation.
An alternative to a view over the underlying char
data is, of course, to copy it, but I can appreciate why doing so might be considered undesirable (we already tend to do a lot of copying when working with std::filesystem::path
).
From this character types reference about char8_t
:
It has the same size, signedness, and alignment as
unsigned char
(and. therefore, the same size and alignment aschar
andsigned char
), but is a distinct type.
Because it's a distinct type you can not convert from const char*
to const char8_t*
without breaking strict aliasing. But for all practical purposes, since char8_t
is basically a unsigned char
you can use reinterpret_cast
to convert the pointer. It's wrong but will work.
For proper correctness either use char8_t
to begin with, or copy the original characters into a char8_t
buffer (or std::u8string
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With