I am using std::wstring_convert to convert a wstring into a multibyte string as follows:
    // convert from wide char to multibyte char
    try
    {
        return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(wideMessage);
    }
    // thrown by std::wstring_convert.to_bytes() for bad conversions
    catch (std::range_error& exception)
    {
        // do something...
    }
In order to unit test the block I have commented as do something... I wish to pass a wstring that will throw a std::range_error exception.
However, I have not been able to formulate such a wstring that will fail such a conversion.  The wstring will use UTF16 and I have been reading about high and low surrogates.  For example, a UTF16 character of D800 followed by "b" should be invalid.  std::wstring(L"\xd800b"); fails to compile on the same grounds possibly.  If I create a wstring such as below it will not throw the exception on conversion:
std::wstring wideMessage(L" b");
wideMessage[0] = L'\xd800';
// doesn't throw
std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(wideMessage);
Is there a suitable wstring I can use to throw an exception during the conversion?
I have tried 5.1, 5.2 and 5.3 from this link. I am using Visual Studio 2015.
Microsoft's implementation of std::codecvt_utf8 appears to successfully convert any UTF-16 code unit into UTF-8—including surrogate pairs. This is a bug, as surrogates are not encodable. Both libc++ (LLVM) and libstdc++ (GCC) will correctly throw a std::range_error and fail to convert unpaired surrogates.
Looking at their code, it appears that the only way for it to throw is if a character is greater than the Maxcode template parameter of the facet. For example:
std::wstring_convert<std::codecvt_utf8<wchar_t, 0x1>>
                        As pointed out by 一二三 (accepted answer) Microsoft's implementation of codecvt_utf8 appears to be bugged.
I know the strings I am dealing with are always UTF16, and I want to convert to UTF8. I ended up changing the implementation as follows:
    // convert from wide char to multibyte char
    try
    {
        return std::wstring_convert<std::codecvt_utf8_utf16 <wchar_t>>().to_bytes(wideMessage);
    }
    // thrown by std::wstring_convert.to_bytes() for bad conversions
    catch (const std::range_error & exception)
    {
        // do something...
    }
The following will now throw correctly:
std::wstring wideMessage(L" b");
wideMessage[0] = L'\xd800';
// throw std::range_error
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>().to_bytes(wideMessage);
I would have never found this bug without unit testing!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With