I can't use prepackaged Unicode string libraries, such as ICU, because they blow up the size of the binary to an insane degree (it's a 200k program; ICU is 16MB+!).
I'm using the builtin wchar_t
string type for everything already, but I want to ensure I'm not doing anything stupid in terms of doing iteration on strings, or things like that.
Are there tools like Fuzzers do for security but for Unicode? That is, throw characters outside of the Basic Multilingual Plane at my code and ensure things get handled correctly as UTF-16?
(Oh, and obviously a cross platform solution works, though most cross platform things would have to support both UTF-8 and UTF-16)
EDIT: Also note things that are less obvious than UTF-16 surrogate pairs -- things like accent marks!
Use WM_UNICHAR
, it handles UTF-32 and can handle Unicode Supplementary Plane characters.
While this is almost true, but the complete truth looks like this:
WM_UNICHAR
is a hack designed for ANSI Windows to receive Unicode characters. Create a Unicode window and you will never receive it.WM_UNICHAR
with 0xffff
to which you must react by returning 1 (the default window procedure will return 0). Fail to do this, and you will never see a WM_UNICHAR
again. Good job that the official documentation doesn't tell you that.WM_UNICHAR
(such as my Windows 7 64 system) and it still won't work, even if you do everything correctly.There is nothing to audit or to pay attention to.
Compile with UNICODE
defined, or explicitly create your window class as well as your window using a "W
" function, and use WM_CHAR
as if this was the most natural thing to do. That's it. It is indeed the most natural thing.
WM_CHAR
uses UTF-16 (except when it doesn't, such as under Windows 2000). Of course, a single UTF-16 character cannot represent code points outside the BMP, but that is not a problem because you simply get two WM_CHAR
messages containing a surrogate pair. It's entirely transparent to your application, you do not need to do anything special. Any Windows API function that accepts a wide character string will happily accept these surrogates, too.
The only thing to be aware of is that now the character length of a string (obviously) is no longer simply the number of 16-bit words. But that was a wrong assumption to begin with, anyway.
In reality, on many (most? all?) systems, you just get a single WM_CHAR
message with wParam
containing the low 16 bits of the key code. Which is mighty fine for anything within the BMP, but sucks otherwise.
I have verified this both by using Alt-keypad codes and creating a custom keyboard layout which generates code points outside the BMP. In either case, only a single WM_CHAR
is received, containing the lower 16 bits of the character. The upper 16 bits are simply thrown away.
In order for your program to work 100% correctly with Unicode, you must apparently use the input method manager (ImmGetCompositionStringW
), which is a nuisance and badly documented. For me, personally, this simply means: "OK, screw that". But if you are interested in being 100% correct, look at the source code of any editor using Scintilla (link to line) which does just that and works perfectly.
Some things to check:
Make sure that instead of handing WM_CHAR
you're handling WM_UNICHAR
:
The
WM_UNICHAR
message is the same asWM_CHAR
, except it uses UTF-32. It is designed to send or post Unicode characters to ANSI windows, and it can handle Unicode Supplementary Plane characters.
Do not assume that the ith character is at index i
. It obviously isn't, and if you happen to use that fact for, say, breaking a string in half, then you could be messing it up.
Don't tell the user (in a status bar or something) that the user has N characters just because the character array has length N.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With