Convert between std::u8string and std::string

People also ask

Is string the same as std::string?

There is no functionality difference between string and std::string because they're the same type.

How do you convert std::string to Lpwstr?

std::wstring stemp = std::wstring(s. begin(), s. end()); LPCWSTR sw = stemp. c_str();

Is std::string utf8?

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

What is std::string in CPP?

C++ has in its definition a way to represent a sequence of characters as an object of the class. This class is called std:: string. String class stores the characters as a sequence of bytes with the functionality of allowing access to the single-byte character.

UTF-8 "support" in C++20 seems to be a bad joke.

The only UTF functionality in the Standard Library is support for strings and string_views (std::u8string, std::u8string_view, std::u16string, ...). That is all. There is no Standard Library support for UTF coding in regular expressions, formatting, file i/o and so on.

In C++17 you can--at least--easily treat any UTF-8 data as 'char' data, which makes usage of std::regex, std::fstream, std::cout, etc. possible without loss of performance.

In C++20 things will change. You cannot longer write for example std::string text = u8"..."; It will be impossible to write something like

Click to copy

std::u8fstream file; std::u8string line; ... file << line;

since there is no std::u8fstream.

Even the new C++20 std::format does not support UTF at all, because all necessary overloads are simply missing. You cannot write

Click to copy

std::u8string text = std::format(u8"...{}...", 42);

To make matters worse, there is no simple casting (or conversion) between std::string and std::u8string (or even between const char* and const char8_t*). So if you want to format (using std::format) or input/output (std::cin, std::cout, std::fstream, ...) UTF-8 data, you have to internally copy all strings. - That will be an unnecessary performance killer.

Finally, what use will UTF have without input, output, and formatting?

At present, std::c8rtomb and std::mbrtoc8 are the the only interfaces provided by the standard that enable conversion between the execution encoding and UTF-8. The interfaces are awkward. They were designed to match pre-existing interfaces like std::c16rtomb and std::mbrtoc16. The wording added to the C++ standard for these new interfaces intentionally matches the wording in the C standard for the pre-existing related functions (hopefully these new functions will eventually be added to C; I still need to pursue that). The intent in matching the C standard wording, as confusing as it is, is to ensure that anyone familiar with the C wording recognizes that the char8_t interfaces work the same way.

cppreference.com has some examples for the UTF-16 versions of these functions that should be useful for understanding the char8_t variants.

https://en.cppreference.com/w/cpp/string/multibyte/mbrtoc16
https://en.cppreference.com/w/cpp/string/multibyte/c16rtomb

The common answer given from C++ authorities at the yearly CppCon convention (like in 2018 and 2019) was that should you pick your own UTF8 library to do so. There are all kinds of flavours just pick the one you like. There is still embarrassing little understanding and support for unicode on the C++ side.

Some people hope there will be something in C++23 but we don't even have an official working group so far.

Update 2021 MAR 19

Few things have (not) happened. __STDC_UTF_8__ is no more and <cuchar> is still not implemented by any of "the Three".

Probably much better code matching this thread is HERE.

Update 2020 MAR 17

std::c8rtomb and std::mbrtoc8 are not yet provided.

2019 NOV

std::c8rtomb and std::mbrtoc8 are not yet provided, by the future C++20 ready compilers made by "The 3", to enable the conversion between the execution encoding and UTF-8. They are described in the C++20 standard.

It might be subjective, but c8rtomb() is not an "awkward" interface, to me.

WANDBOX

Click to copy

//  g++ prog.cc -std=gnu++2a
//  clang++ prog.cc -std=c++2a
#include <stdio.h>
#include <clocale>
#ifndef __clang__
#include <cuchar>
#else
// clang has no <cuchar>
#include <uchar.h>
#endif
#include <climits>

template<size_t N>
void  u32sample( const char32_t (&str32)[N] )
{
    #ifndef __clang__
    std::mbstate_t state{};
    #else
    mbstate_t state{};
    #endif
    
    char out[MB_LEN_MAX]{};
    for(char32_t const & c : str32)
    {
    #ifndef __clang__
        /*std::size_t rc =*/ std::c32rtomb(out, c, &state);
    #else
        /* std::size_t rc =*/ ::c32rtomb(out, c, &state);
    #endif
        printf("%s", out ) ;
    }
}

#ifdef __STDC_UTF_8__
template<size_t N>
void  u8sample( const char8_t (& str8)[N])
{
    std::mbstate_t state{};
    
    char out[MB_LEN_MAX]{};
    for(char8_t const & c : str8)
    {
       /* std::size_t rc = */ std::c8rtomb(out, c, &state);
        printf("%s", out ) ;
    }
}
#endif // __STDC_UTF_8__
int main () {
    std::setlocale(LC_ALL, "en_US.utf8");

    #ifdef __linux__
    printf("\nLinux like OS, ") ;
    #endif

    printf(" Compiler %s\n", __VERSION__   ) ;
    
   printf("\nchar32_t *, Converting to 'char *', and then printing --> " ) ;
   u32sample( U"ひらがな" ) ;
    
  #ifdef __STDC_UTF_8__
   printf("\nchar8_t *, Converting to 'char *', and then printing --> " ) ;
   u8sample( u8"ひらがな" ) ;
  #else
   printf("\n\n__STDC_UTF_8__ is not defined, can not use char8_t");
  #endif
   
   printf("\n\nDone ..." ) ;
    
    return 42;
}

I have commented out and documented, lines which do not compile as of today.

VS 2019

Click to copy

  ostream& operator<<(ostream& os, const u8string& str)
    {
        os << reinterpret_cast<const char*>(str.data());
        return os;
    }

To set console to UTF-8 use https://github.com/MicrosoftDocs/cpp-docs/issues/1915#issuecomment-589644386

Related questions
                            
                                std::move of string literal - which compiler is correct?
                            
                                Guaranteed memory layout for standard layout struct with a single array member of primitive type
                            
                                C++ Move semantics and Exceptions
                            
                                Do C++11 regular expressions work with UTF-8 strings?
                            
                                C++ Inheritance in Separate Files Using #include and Inclusion Guards
                            
                                What does "Assignable" really mean?
                            
                                Is it illegal to invoke a std::function<void(Args...)> under the standard?
                            
                                Different compiler behavior when applying a const qualifier to a template argument
                            
                                Using alias templates for sfinae: does the language allow it?
                            
                                std::istream_iterator<> with copy_n() and friends
                            
                                Can adding 'constexpr' change the behaviour?
                            
                                An 'if constexpr branch' does not get discarded inside lambda that is inside a template function
                            
                                How to link a .DLL statically?
                            
                                How to allow copy elision construction for C++ classes (not just POD C structs)
                            
                                Is decltype(auto) for a structured binding supposed to be a reference?
                            
                                Why doesn't a class having private constructor prevent inheriting from this class? How to control which classes can inherit from a certain base?
                            
                                Do compilers automatically use move semantics when a movable object is used for the last time?
                            
                                High-quality open-source text-to-speech (TTS) engines written in C++ [closed]
                            
                                Virtual explicit conversion operator overriding
                            
                                Does a phantom type have the same alignment as the original one?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert between std::u8string and std::string

Tags:

c++

unicode

utf-8

c++20

People also ask

Update 2021 MAR 19

Update 2020 MAR 17

Recent Activity

Donate For Us