Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting a boost::filesystem::path as an UTF-8 encoded std::string, on Windows

We are representing paths as boost::filesystem::path, but in some cases other APIs are expecting them as const char * (e.g., to open a DB file with SQLite).

From the documentation, path::value_type is a wchar_t under Windows. As far as I know, Windows wchar_t are 2 bytes, UTF-16 encoded.

There is a string() native observer that returns a std::string, while stating:

If string_type is a different type than String, conversion is performed by cvt.

cvt is initialized to a default constructed codecvt. What is the behaviour of this default constructed codecvt?

There is this forum entry, that recommends to use an instance of utf8_codecvt_facet as the cvt value to portably convert to UTF-8. But it seems that this codecvt is actually to convert between UTF-8 and UCS-4, not UTF-16.

What would be the best way (and if possible portable) to obtain an std::string representation of a path, making sure to convert from the right wchar_t encoding when necessary?

like image 715
Ad N Avatar asked Aug 08 '17 15:08

Ad N


People also ask

Is std::string utf8?

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

What is the encoding of std::string?

std::string doesn't have the concept of encodings. It just stores whatever is passed to it. cout <<'è';

What is the correct way to represent paths in boost?

We are representing paths as boost::filesystem::path, but in some cases other APIs are expecting them as const char * (e.g., to open a DB file with SQLite). From the documentation, path::value_type is a wchar_t under Windows.

Is the string constant UTF-8 or UTF-16?

The string constant is UTF-8 (as all SMC source code). The fs::path is UTF-16 on Windows. I think it is required for string constants also if used with an fs::path. Sorry, something went wrong. Sorry about that. I got it fixed.

Are there any UTF-8-encoded strings in R for Windows?

Various bug reports have been filed about the treatment of UTF-8-encoded strings in R for Windows, including 11515, 14271, 15762, 16064, 16101, and 16232. Show activity on this post. Not sure if I understand it correctly.

Are filenames UTF-8?

Note: I think it is wrong to assume filenames are UTF-8 (I do not remember operating system API which prescribe such encoding). If you read the filename from OS, just do not encode it again. The OS is specified in the session output in the question :). It is Win 10 x64. Show activity on this post.


1 Answers

cvt is initialized to a default constructed codecvt. What is the behaviour of this default constructed codecvt?

It uses the default locale for conversion to the locale-specific multi-byte character set. On Windows this locale normally corresponds to the regional settings in the control panel.

What would be the best way (and if possible portable) to obtain an std::string representation of a path, making sure to convert from the right wchar_t encoding when necessary?

The C++11 standard introduced std::codecvt_utf8_utf16. Although it is deprecated as of C++17, according to this paper it will be available "until a suitable replacement is standardized".

To use this facet, call the static function:

boost::filesystem::path::imbue( 
    std::locale( std::locale(), new std::codecvt_utf8_utf16<wchar_t>() ) );

After that all calls to path::string() will convert from UTF-16 to UTF-8.

Another way is to use std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> > to do the conversion only in some cases.

Complete example code:

#include <boost/filesystem.hpp>
#include <iostream>
#include <codecvt>

void print_hex( std::string const& path );

int main()
{
    // Create UTF-16 path (on Windows) that contains the characters "ÄÖÜ".
    boost::filesystem::path path( L"\u00c4\u00d6\u00dc" );

    // Convert path using the default locale and print result.
    // On a system with german default locale, this prints "0xc4 0xd6 0xdc".
    // On a system with a different locale, this might fail.
    print_hex( path.string() );

    // Set locale for conversion from UTF-16 to UTF-8.
    boost::filesystem::path::imbue( 
        std::locale( std::locale(), new std::codecvt_utf8_utf16<wchar_t>() ) );

    // Because we changed the locale, path::string() now converts the path to UTF-8.
    // This always prints the UTF-8 bytes "0xc3 0x84 0xc3 0x96 0xc3 0x9c".
    print_hex( path.string() );

    // Another option is to convert only case-by-case, by explicitly using a code converter.
    // This always prints the UTF-8 bytes "0xc3 0x84 0xc3 0x96 0xc3 0x9c".
    std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> > cvt;
    print_hex( cvt.to_bytes( path.wstring() ) );
}

void print_hex( std::string const& path )
{
    for( char c : path )
    {
        std::cout << std::hex << "0x" << static_cast<unsigned>(static_cast<unsigned char>( c )) << ' ';
    }
    std::cout << '\n';
}
like image 190
zett42 Avatar answered Oct 16 '22 18:10

zett42