Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what are non-unicode applications

As we know, in windows system, we can set locale language for non-Unicode programs in "Control Panel\Clock, Language, and Region". But what does a local language mean for an application? Since to my understanding, an application is a compiled binary executable file, which only contained machine code instructions and no data, so how the character encoding affect their running?

One guess is if the executable file contain some literal strings in code segment, it will use some internal Charset to encoding them. If the charset is not unicode, then it will display garbage. But is not the internal Charset is a fixed one? Just like in Java, java spec defines the internal encoding is UTF-16.

Hope someone can answer my questions,

Thanks.

like image 619
Alfred Avatar asked Oct 07 '10 08:10

Alfred


2 Answers

Windows has two methods by which programs can talk to it, called the "ANSI API" and the "Unicode API", and a "non-unicode application" is one that talks to Windows via the "ANSI API" rather than the "Unicode API".

What that means is that any string that the application passes to Windows is just a sequence of bytes, not a sequence of Unicode characters. Windows has to decide which characters that sequence of bytes corresponds with, and the Control Panel setting you're talking about is how it does that.

So for example, a non-unicode program that outputs a byte with value 0xE4 on a PC set to use Windows Western will display the character ä, whereas one set up for Hebrew will display the character ה.

like image 149
RichieHindle Avatar answered Sep 26 '22 17:09

RichieHindle


RichieHindle correctly explains that there are two variants of most API's, a *W (Unicode) and a *A (ANSI) variant. But after that he's slightly wrong.

It's important to know that the *A variants (such as MessageBoxA) are just wrappers for the *W versions (such as MessageBoxW). They take the input strings and convert them to Unicode; they take the output strings and convert them back.

In the Windows SDK, for all such A/W pairs, there is a #ifdef UNICODE block such that MessageBox() is a macro that expands to either MessageBoxA() or MessageBoxW(). Because all macros use the same condition, many programs use either 100% *A functions or 100% *W functions. "non-Unicode" applications are then those that have not defined UNICODE, and therefore use the *A variants exclusively.

However, there is no reason why you can't mix-and-match *A and *W functions. Would programs that mix *A and *W functions be considered "Unicode", "non-Unicode" or even something else? Actually, the answer is also mixed. When it comes to that Clock, Language, and Region setting, an application is considered a Unicode application when it's making a *W call, and a non-Unicode application when it's making a *A call - the setting controls how the *A wrappers translate to *W calls. And in multi-threaded programs, you can therefore be both at the same time (!)

So, to come back to RichieHindle's example, if you call a *A function with value (char)0xE4, the wrapper will forward to the *W function with either L'ä' or L'ה' depending on this setting. If you then call the *W function directly with the value (WCHAR)0x00E4, no translation happens.

like image 41
MSalters Avatar answered Sep 25 '22 17:09

MSalters