We are representing paths as boost::filesystem::path
, but in some cases other APIs are expecting them as const char *
(e.g., to open a DB file with SQLite).
From the documentation, path::value_type
is a wchar_t
under Windows. As far as I know, Windows wchar_t
are 2 bytes, UTF-16 encoded.
There is a string()
native observer that returns a std::string
, while stating:
If string_type is a different type than String, conversion is performed by cvt.
cvt
is initialized to a default constructed codecvt
. What is the behaviour of this default constructed codecvt?
There is this forum entry, that recommends to use an instance of utf8_codecvt_facet
as the cvt
value to portably convert to UTF-8. But it seems that this codecvt is actually to convert between UTF-8 and UCS-4, not UTF-16.
What would be the best way (and if possible portable) to obtain an std::string
representation of a path
, making sure to convert from the right wchar_t
encoding when necessary?
Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.
std::string doesn't have the concept of encodings. It just stores whatever is passed to it. cout <<'è';
We are representing paths as boost::filesystem::path, but in some cases other APIs are expecting them as const char * (e.g., to open a DB file with SQLite). From the documentation, path::value_type is a wchar_t under Windows.
The string constant is UTF-8 (as all SMC source code). The fs::path is UTF-16 on Windows. I think it is required for string constants also if used with an fs::path. Sorry, something went wrong. Sorry about that. I got it fixed.
Various bug reports have been filed about the treatment of UTF-8-encoded strings in R for Windows, including 11515, 14271, 15762, 16064, 16101, and 16232. Show activity on this post. Not sure if I understand it correctly.
Note: I think it is wrong to assume filenames are UTF-8 (I do not remember operating system API which prescribe such encoding). If you read the filename from OS, just do not encode it again. The OS is specified in the session output in the question :). It is Win 10 x64. Show activity on this post.
cvt is initialized to a default constructed codecvt. What is the behaviour of this default constructed codecvt?
It uses the default locale for conversion to the locale-specific multi-byte character set. On Windows this locale normally corresponds to the regional settings in the control panel.
What would be the best way (and if possible portable) to obtain an std::string representation of a path, making sure to convert from the right wchar_t encoding when necessary?
The C++11 standard introduced std::codecvt_utf8_utf16
. Although it is deprecated as of C++17
, according to this paper it will be available "until a suitable replacement is standardized".
To use this facet, call the static function:
boost::filesystem::path::imbue(
std::locale( std::locale(), new std::codecvt_utf8_utf16<wchar_t>() ) );
After that all calls to path::string()
will convert from UTF-16 to UTF-8.
Another way is to use std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> >
to do the conversion only in some cases.
Complete example code:
#include <boost/filesystem.hpp>
#include <iostream>
#include <codecvt>
void print_hex( std::string const& path );
int main()
{
// Create UTF-16 path (on Windows) that contains the characters "ÄÖÜ".
boost::filesystem::path path( L"\u00c4\u00d6\u00dc" );
// Convert path using the default locale and print result.
// On a system with german default locale, this prints "0xc4 0xd6 0xdc".
// On a system with a different locale, this might fail.
print_hex( path.string() );
// Set locale for conversion from UTF-16 to UTF-8.
boost::filesystem::path::imbue(
std::locale( std::locale(), new std::codecvt_utf8_utf16<wchar_t>() ) );
// Because we changed the locale, path::string() now converts the path to UTF-8.
// This always prints the UTF-8 bytes "0xc3 0x84 0xc3 0x96 0xc3 0x9c".
print_hex( path.string() );
// Another option is to convert only case-by-case, by explicitly using a code converter.
// This always prints the UTF-8 bytes "0xc3 0x84 0xc3 0x96 0xc3 0x9c".
std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> > cvt;
print_hex( cvt.to_bytes( path.wstring() ) );
}
void print_hex( std::string const& path )
{
for( char c : path )
{
std::cout << std::hex << "0x" << static_cast<unsigned>(static_cast<unsigned char>( c )) << ' ';
}
std::cout << '\n';
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With