Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get wstring_convert::to_bytes to throw a range_error exception?

I am using std::wstring_convert to convert a wstring into a multibyte string as follows:

    // convert from wide char to multibyte char
    try
    {
        return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(wideMessage);
    }

    // thrown by std::wstring_convert.to_bytes() for bad conversions
    catch (std::range_error& exception)
    {
        // do something...
    }

In order to unit test the block I have commented as do something... I wish to pass a wstring that will throw a std::range_error exception.

However, I have not been able to formulate such a wstring that will fail such a conversion. The wstring will use UTF16 and I have been reading about high and low surrogates. For example, a UTF16 character of D800 followed by "b" should be invalid. std::wstring(L"\xd800b"); fails to compile on the same grounds possibly. If I create a wstring such as below it will not throw the exception on conversion:

std::wstring wideMessage(L" b");
wideMessage[0] = L'\xd800';

// doesn't throw
std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(wideMessage);

Is there a suitable wstring I can use to throw an exception during the conversion?

I have tried 5.1, 5.2 and 5.3 from this link. I am using Visual Studio 2015.

like image 539
Class Skeleton Avatar asked Aug 24 '15 12:08

Class Skeleton


2 Answers

Microsoft's implementation of std::codecvt_utf8 appears to successfully convert any UTF-16 code unit into UTF-8—including surrogate pairs. This is a bug, as surrogates are not encodable. Both libc++ (LLVM) and libstdc++ (GCC) will correctly throw a std::range_error and fail to convert unpaired surrogates.

Looking at their code, it appears that the only way for it to throw is if a character is greater than the Maxcode template parameter of the facet. For example:

std::wstring_convert<std::codecvt_utf8<wchar_t, 0x1>>
like image 63
一二三 Avatar answered Nov 11 '22 19:11

一二三


As pointed out by 一二三 (accepted answer) Microsoft's implementation of codecvt_utf8 appears to be bugged.

I know the strings I am dealing with are always UTF16, and I want to convert to UTF8. I ended up changing the implementation as follows:

    // convert from wide char to multibyte char
    try
    {
        return std::wstring_convert<std::codecvt_utf8_utf16 <wchar_t>>().to_bytes(wideMessage);
    }

    // thrown by std::wstring_convert.to_bytes() for bad conversions
    catch (const std::range_error & exception)
    {
        // do something...
    }

The following will now throw correctly:

std::wstring wideMessage(L" b");
wideMessage[0] = L'\xd800';

// throw std::range_error
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>().to_bytes(wideMessage);

I would have never found this bug without unit testing!

like image 1
Class Skeleton Avatar answered Nov 11 '22 18:11

Class Skeleton