Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this code safe using wstring with MultiByteToWideChar?

Tags:

c++

winapi

Using std::wstring the way I am with MultiByteToWideChar?

std::wstring widen(const std::string &in)
{
    int len = MultiByteToWideChar(CP_UTF8, 0, &in[0], -1, NULL, 0);
    std::wstring out(len, 0);
    MultiByteToWideChar(CP_UTF8, 0, &in[0], -1, &out[0], len);
    return out;
}
like image 929
Josh Avatar asked Jun 28 '26 10:06

Josh


2 Answers

If you're asking will it work, probably. Is it correct?

  1. You should use in.c_str() instead of &in[0]
  2. You should check the return value of MultiByteToWideChar at least the first time.
  3. MultiByteToWideChar invoked with a (-1) length, if successful, will include accounting for a zero-terminator (i.e. it will always return >= 1 on success). The length-constructor for std::wstring does not require this. std::wstring(5,0) will allocate space for six wide-chars; 5+zero-term. So technically you're allocating one-too-many wide-chars.

From MultiByteToWideChar docs on cbMultiByte and -1:

If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting Unicode string has a terminating null character, and the length returned by the function includes this character.

like image 158
WhozCraig Avatar answered Jun 29 '26 22:06

WhozCraig


There is a problem with your first call to MultiByteToWideChar: The character sequence is not guaranteed to be zero terminated (although in practice it usually is). Change that line to

int len = MultiByteToWideChar(CP_UTF8, 0, in.c_str(), -1, NULL, 0);

and you should be safe. Even if MultiByteToWideChar fails and returns 0 this is accounted for by passing len as the final parameter in the second call to MultiByteToWideChar.

With that said, it is safe in the sense that it doesn't crash or corrupt memory. There is, however, one more issue: Unless the input string causes MultiByteToWideChar to fail the returned string will claim that its size() is one character larger than it should be. I would recommend changing the code as follows:

std::wstring widen(std::string const &in)
{
    std::wstring out{};

    if (in.length() > 0)
    {
        // Calculate target buffer size (not including the zero terminator).
        int len = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
                                      in.c_str(), in.size(), NULL, 0);
        if ( len == 0 )
        {
            throw std::runtime_error("Invalid character sequence.");
        }

        out.resize(len);
        // No error checking. We already know, that the conversion will succeed.
        MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
                            in.c_str(), in.size(), &out[0], out.size());
                            // Use out.data() in place of &out[0] for C++17
    }

    return out;
}

This implementation addresses the following issues:

  • It reports errors in case the input sequence is not valid UTF-8, by passing the MB_ERR_INVALID_CHARS flag.
  • Errors are reported by throwing exceptions. That makes it possible to distinguish between conversion errors and a successful call, that returns a zero-sized string. (Note: The std::wstring c'tor already throws exceptions in case of failure. It would feel unnatural to not throw exceptions for other errors.)
  • The implementation properly deals with input containing embedded NUL characters. This is rarely used, but when it is (e.g. when composing the OPENFILENAME's lpstrFilter member), it won't (silently) fail for that reason.
  • It doesn't over-allocate the return value's container storage. In case the cbMultiByte argument is set to -1 in a call to MultiByteToWideChar, the returned length does include space for the zero terminator. This character, however, is owned by the std::string implementation, and not part of the character sequence to be converted.
  • Related to the previous bullet point, this implementation doesn't convert the zero terminator. The original code did, and the returned string produces 2 NUL characters at the end of the string, when the c_str() member is invoked.
like image 27
IInspectable Avatar answered Jun 30 '26 00:06

IInspectable