Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I audit my Windows application for correct Unicode handling?

I can't use prepackaged Unicode string libraries, such as ICU, because they blow up the size of the binary to an insane degree (it's a 200k program; ICU is 16MB+!).

I'm using the builtin wchar_t string type for everything already, but I want to ensure I'm not doing anything stupid in terms of doing iteration on strings, or things like that.

Are there tools like Fuzzers do for security but for Unicode? That is, throw characters outside of the Basic Multilingual Plane at my code and ensure things get handled correctly as UTF-16?

(Oh, and obviously a cross platform solution works, though most cross platform things would have to support both UTF-8 and UTF-16)

EDIT: Also note things that are less obvious than UTF-16 surrogate pairs -- things like accent marks!

like image 477
Billy ONeal Avatar asked Jun 20 '11 15:06

Billy ONeal


2 Answers

The wrong answer

Use WM_UNICHAR, it handles UTF-32 and can handle Unicode Supplementary Plane characters.

While this is almost true, but the complete truth looks like this:

  1. WM_UNICHAR is a hack designed for ANSI Windows to receive Unicode characters. Create a Unicode window and you will never receive it.
  2. Create an ANSI window and you will be surprised that it still doesn't work as expected. The catch is that when the window is created, you receive a WM_UNICHAR with 0xffff to which you must react by returning 1 (the default window procedure will return 0). Fail to do this, and you will never see a WM_UNICHAR again. Good job that the official documentation doesn't tell you that.
  3. Run your program on a system that, for mysterious reasons, doesn't support WM_UNICHAR (such as my Windows 7 64 system) and it still won't work, even if you do everything correctly.

The theoretically correct answer

There is nothing to audit or to pay attention to.

Compile with UNICODE defined, or explicitly create your window class as well as your window using a "W" function, and use WM_CHAR as if this was the most natural thing to do. That's it. It is indeed the most natural thing.

WM_CHAR uses UTF-16 (except when it doesn't, such as under Windows 2000). Of course, a single UTF-16 character cannot represent code points outside the BMP, but that is not a problem because you simply get two WM_CHAR messages containing a surrogate pair. It's entirely transparent to your application, you do not need to do anything special. Any Windows API function that accepts a wide character string will happily accept these surrogates, too.
The only thing to be aware of is that now the character length of a string (obviously) is no longer simply the number of 16-bit words. But that was a wrong assumption to begin with, anyway.

The sad truth

In reality, on many (most? all?) systems, you just get a single WM_CHAR message with wParam containing the low 16 bits of the key code. Which is mighty fine for anything within the BMP, but sucks otherwise.

I have verified this both by using Alt-keypad codes and creating a custom keyboard layout which generates code points outside the BMP. In either case, only a single WM_CHAR is received, containing the lower 16 bits of the character. The upper 16 bits are simply thrown away.

In order for your program to work 100% correctly with Unicode, you must apparently use the input method manager (ImmGetCompositionStringW), which is a nuisance and badly documented. For me, personally, this simply means: "OK, screw that". But if you are interested in being 100% correct, look at the source code of any editor using Scintilla (link to line) which does just that and works perfectly.

like image 165
Damon Avatar answered Nov 12 '22 08:11

Damon


Some things to check:

  • Make sure that instead of handing WM_CHAR you're handling WM_UNICHAR:

    The WM_UNICHAR message is the same as WM_CHAR, except it uses UTF-32. It is designed to send or post Unicode characters to ANSI windows, and it can handle Unicode Supplementary Plane characters.

  • Do not assume that the ith character is at index i. It obviously isn't, and if you happen to use that fact for, say, breaking a string in half, then you could be messing it up.

  • Don't tell the user (in a status bar or something) that the user has N characters just because the character array has length N.

like image 40
user541686 Avatar answered Nov 12 '22 07:11

user541686