Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wchar_t* to char* conversion problems

Tags:

c++

char

wchar

I have a problem with wchar_t* to char* conversion.

I'm getting a wchar_t* string from the FILE_NOTIFY_INFORMATION structure, returned by the ReadDirectoryChangesW WinAPI function, so I assume that string is correct.

Assume that wchar string is "New Text File.txt" In Visual Studio debugger when hovering on variable in shows "N" and some unknown Chinese letters. Though in watches string is represented correctly.

When I try to convert wchar to char with wcstombs

wcstombs(pfileName, pwfileName, fileInfo.FileNameLength);

it converts just two letters to char* ("Ne") and then generates an error.

Some internal error in wcstombs.c at function _wcstombs_l_helper() at this block:

if (*pwcs > 255)  /* validate high byte */
{
    errno = EILSEQ;
    return (size_t)-1;  /* error */
}

It's not thrown up as exception.

What can be the problem?

like image 665
Anatoly Lushnikov Avatar asked Dec 07 '22 17:12

Anatoly Lushnikov


1 Answers

In order to do what you're trying to do The Right Way, there are several nontrivial things that you need to take into account. I'll do my best to break them down for you here.

Let's start with the definition of the count parameter from the wcstombs() function's documentation on MSDN:

The maximum number of bytes that can be stored in the multibyte output string.

Note that this does NOT say anything about the number of wide characters in the wide character input string. Even though all of the wide characters in your example input string ("New Text File.txt") can be represented as single-byte ASCII characters, we cannot assume that each wide character in the input string will generate exactly one byte in the output string for every possible input string (if this statement confuses you, you should check out Joel's article on Unicode and character sets). So, if you pass wcstombs() the size of the output buffer, how does it know how long the input string is? The documentation states that the input string is expected to be null-terminated, as per the standard C language convention:

If wcstombs encounters the wide-character null character (L'\0') either before or when count occurs, it converts it to an 8-bit 0 and stops.

Though this isn't explicitly stated in the documentation, we can infer that if the input string isn't null-terminated, wcstombs() will keep reading wide characters until it has written count bytes to the output string. So if you're dealing with a wide character string that isn't null-terminated, it isn't enough to just know how long the input string is; you would have to somehow know exactly how many bytes the output string would need to be (which is impossible to determine without doing the conversion) and pass that as the count parameter to make wcstombs() do what you want it to do.

Why am I focusing so much on this null-termination issue? Because the FILE_NOTIFY_INFORMATION structure's documentation on MSDN has this to say about its FileName field:

A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.

The fact that the FileName field isn't null-terminated explains why it has a bunch of "unknown Chinese letters" at the end of it when you look at it in the debugger. The FILE_NOTIFY_INFORMATION structure's documentation also contains another nugget of wisdom regarding the FileNameLength field:

The size of the file name portion of the record, in bytes.

Note that this says bytes, not characters. Therefore, even if you wanted to assume that each wide character in the input string will generate exactly one byte in the output string, you shouldn't be passing fileInfo.FileNameLength for count; you should be passing fileInfo.FileNameLength / sizeof(WCHAR) (or use a null-terminated input string, of course). Putting all of this information together, we can finally understand why your original call to wcstombs() was failing: it was reading past the end of the string and choking on invalid data (thereby triggering the EILSEQ error).

Now that we've elucidated the problem, it's time to talk about a possible solution. In order to do this The Right Way, the first thing you need to know is how big your output buffer needs to be. Luckily, there is one final tidbit in the documentation for wcstombs() that will help us out here:

If the mbstr argument is NULL, wcstombs returns the required size in bytes of the destination string.

So the idiomatic way to use the wcstombs() function is to call it twice: the first time to determine how big your output buffer needs to be, and the second time to actually do the conversion. The final thing to note is that as we stated previously, the wide character input string needs to be null-terminated for at least the first call to wcstombs().

Putting this all together, here is a snippet of code that does what you are trying to do:

size_t fileNameLengthInWChars = fileInfo.FileNameLength / sizeof(WCHAR); //get the length of the filename in characters
WCHAR *pwNullTerminatedFileName = new WCHAR[fileNameLengthInWChars + 1]; //allocate an intermediate buffer to hold a null-terminated version of fileInfo.FileName; +1 for null terminator
wcsncpy(pwNullTerminatedFileName, fileInfo.FileName, fileNameLengthInWChars); //copy the filename into a the intermediate buffer
pwNullTerminatedFileName[fileNameLengthInWChars] = L'\0'; //null terminate the new buffer
size_t fileNameLengthInChars = wcstombs(NULL, pwNullTerminatedFileName, 0); //first call to wcstombs() determines how long the output buffer needs to be
char *pFileName = new char[fileNameLengthInChars + 1]; //allocate the final output buffer; +1 to leave room for null terminator
wcstombs(pFileName, pwNullTerminatedFileName, fileNameLengthInChars + 1); //finally do the conversion!

Of course, don't forget to call delete[] pwNullTerminatedFileName and delete[] pFileName when you're done with them to clean up.

ONE LAST THING

After writing this answer, I reread your question a bit more closely and thought of another mistake you may be making. You say that wcstombs() fails after just converting the first two letters ("Ne"), which means that it's hitting uninitialized data in the input string after the first two wide characters. Did you happen to use the assignment operator to copy one FILE_NOTIFY_INFORMATION variable to another? For example,

FILE_NOTIFY_INFORMATION fileInfo = someOtherFileInfo;

If you did this, it would only copy the first two wide characters of someOtherFileInfo.FileName to fileInfo.FileName. In order to understand why this is the case, consider the declaration of the FILE_NOTIFY_INFORMATION structure:

typedef struct _FILE_NOTIFY_INFORMATION {
  DWORD NextEntryOffset;
  DWORD Action;
  DWORD FileNameLength;
  WCHAR FileName[1];
} FILE_NOTIFY_INFORMATION, *PFILE_NOTIFY_INFORMATION;

When the compiler generates code for the assignment operation, it does't understand the trickery that is being pulled with FileName being a variable length field, so it just copies sizeof(FILE_NOTIFY_INFORMATION) bytes from someOtherFileInfo to fileInfo. Since FileName is declared as an array of one WCHAR, you would think that only one character would be copied, but the compiler pads the struct to be an extra two bytes long (so that its length is an integer multiple of the size of an int), which is why a second WCHAR is copied as well.

like image 86
chess007 Avatar answered Dec 25 '22 19:12

chess007