Let's consider the following code listing the directory contents of the path given as the first argument to the program:
#include <filesystem>
#include <iostream>
int main(int argc, char **argv)
{
if(argc != 2)
std::cerr << "Please specify a directory.\n";
for(auto& p: std::filesystem::directory_iterator(argv[1]))
std::cout << p << '\n';
}
On first sight this seems to be very lean, portable and conforming to the C++ standard (please ignore that it does not catch exceptions if the directory does not exist).
However, there seem to be a few pitfalls. In particular, the C++ standard does not seem to mandate that the encoding of argv[1]
matches that accepted by std::filesystem::path
constructors nor does it seem to mandate that the encoding returned by std::filesystem::path::string()
matches that accepted by std::cout
.
Quite the opposite, the standard seems to introduce the new term "native encoding" which may be different from the execution character set encoding and is defined as:
The native encoding of a narrow character string is the operating system dependent current encoding for pathnames ([fs.class.path]).
From my reading of the standard no conversion between encodings takes place if std::filesystem::path::value_type
matches the char
type of argv[1]
(which is true on any POSIX system).
This seems to allow, for example, a conforming implementation in which the execution character set encoding (and hence the encoding of argv[1]
and that accepted by std::cout
) is EBCDIC, but the encoding of strings accepted and provided by the filesystem library is ISO 8859-1, with no conversion performed between the two, making the filesystem library essentially useless. Worse yet, there is no way to figure out if the two encodings are the same or not.
This can even get dangerous if you start to write utilities which delete files and the to be deleted file provided by argv[1]
matches a completely different file when it's interpreted in the native encoding of the filesystem library.
Note that I'm not concerned about filesystems using different encodings than those used by programs. My concern is that the standard does not seem to mandate any conversion of those encodings.
The u8path()
and u8string()
functions are of no use here either because the standard also provides no way to convert between UTF-8 and the execution character set encoding (used by argv[1]
and std::cout
).
Is there any portable, encoding agnostic and standard compliant way to do this?
No, and this is not just theoretical.
On Windows systems, paths are UTF-16, and path::value_type
is wchar_t
, not the char
you get from char** argv
. This isn't a problem by itself - path
can be created from a char*
. However, not every Windows file name can be expressed as a char*
. Hence the program is unable to list the contents of some directories whose name cannot be expressed as char*
.
Now you'd think that Linux would be better. That's actually not entirely the case - the bytes you get for a filename can depend on whether you entered them on a keyboard or via TAB completion!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With