In Visual Studio 2005 on 32-bit Windows, why doesn't my console display characters from 128 to 255?
for example:
cout << "¿" << endl; //inverted question mark
Output:
┐
Press any key to continue . . .
ASCII Extended Character Set, Windows In order to use this shortcut, hold down the "ALT" key while you type in the entire code. When you release the ALT key the glyph will appear. This only works with the number pad on Desktops. On some computers this will only work if your "Num Lock" is on.
Part of the genius of UTF-8 is that ASCII can be considered a 7-bit encoding scheme for a very small subset of Unicode/UCS, and seven-bit ASCII (when prefixed with 0 as the high-order bit) is valid UTF-8. Thus it follows that UTF-8 cannot collide with ASCII. But UTF-8 can and does collide with Extended-ASCII.
Extended ASCII means an eight-bit character encoding that includes (most of) the seven-bit ASCII characters, plus additional characters.
This eight-bit system increases the number of characters ASCII can represent to 256, and it ensures that all special characters, as well as characters from other languages, can be represented.
A Windows console window is pure Unicode. Its buffer stores text as UCS-2 Unicode (16 bits per character, essentially like original Unicode, a restriction to the Basic Multilingual Plane of modern 21-bit Unicode). So a console window can present almost all kinds of text.
However, for single byte per character (and possibly also for some variable length encodings) i/o Windows automatically translates to/from the console window's active codepage. If the console window is a [cmd.exe] instance then you can inspect that via command chcp
, short for change codepage. Like this:
C:\test> chcp Active code page: 850 C:\test> _
Codepage 850 is an encoding based on the original IBM PC English codepage 437. 850 is default for console windows on at least Norwegian PC's (although savvy Norwegians may change that to 865). None of those are codepages that you should use, however.
The original IBM PC codepage (character encoding) is known as OEM, which is a meaningless acronym, Original Equipment Manufacturer. It had nice line drawing characters suitable for the original PC's text mode screen. More generally OEM means the default code page for console windows, where codepage 437 is just the original one: it can be configured, e.g. per window via chcp
.
When Microsoft created 16-bit Windows they chose another encoding known in Windows as ANSI. The original one was an extension of ISO Latin-1 which for a long while was the default on the Internet (however, it's unclear which came first: Microsoft participated in the standardization). This original ANSI is now known as Windows ANSI Western.
ANSI is the code page used for non-Unicode by almost all the rest of Windows. Console windows use OEM. Notepad, other editors, and so on, use ANSI.
Then, when Microsoft made Windows 32-bit, they adopted a 16-bit extension of Latin-1 known as Unicode. Microsoft was an original founding member of the Unicode Consortium. And the basic API, including console windows, the file system, etc., was rewritten to use Unicode. For backward compatibility there is a translation layer that translates between OEM and Unicode for console windows, and between ANSI and Unicode for other functionality. For example, MessageBoxA
is an ANSI wrapper for Unicode-based MessageBoxW
.
The practical upshot of that is that in Windows your C++ source code is typically encoded with ANSI, while console windows assume OEM. Which e.g. makes
cout << "I like Norwegian blåbærsyltetøy!" << endl;
produce pure gobbledegook… You can use the Unicode-based console window APIs to output Unicode directly to a console window, avoiding the translation, but that's awkward.
Note that using wcout
instead of cout
doesn't help: by design wcout
just translates down from wide character strings to the program's narrow character set, discarding information on the way. It can be hard to believe, that the C++ standard library offers a rather big chunk of very very complex functionality that is meaningless (since instead those conversions could just have been supported by cout
). But so it is, just meaningless. Possibly it was some political-like compromise, but anyway, wcout
does not help, even though if it were meaningful in some way then it "should" logically help with this.
So how does a Norwegian novice programmer get e.g. "blåbærsyltetøy" presented?
Well, simply by changing the active code page to ANSI. Since on most Western country PCs ANSI is codepage 1252, you can do that for a given command interpreter instance by
C:\test> chcp 1252 Active code page: 1252 C:\test> _
Now old DOS programs like e.g. [edit.com] (still present in Windows XP!) will produce some gobbledegook, because the original PC character set line drawing characters are not there in ANSI, and because national characters have different codes in ANSI. But hey, who uses old DOS programs? Not me!
If you want this as a more permanent code page, you'll have to change the configuration of console windows via an undocumented registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
In this key, change value of OEMCP
to 1252, and reboot.
As with chcp
, or other change of codepage to 1252, makes old DOS programs present gobbledegook, but makes C++ programs or other modern console programs work OK.
Since you then have same character encoding in console windows as in the rest of Windows.
I'm running on Win10 b19043. Changing to the Unicode codepage (65001) allows printing/displaying Extended ASCII characters in the CMD window. Just type this line in your console or batch file and all should be good:
chcp 65001 1>nul
When you print an ASCII string, Windows internally converts it to UNICODE based on the current code page. There is also a translation from UNICODE to "ASCII" done by the CRT. The following would work.
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
#include <iostream>
void
__cdecl
main(int ac, char **av)
{
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"\u00BF";
}
Because the Win32 console uses code page 437 (aka the OEM font) to render characters, whereas most of the rest of Windows uses Windows-1252 for single-byte character codes.
The character "¿" is the Unicode character INVERTED QUESTION MARK, which has code point 0xBF (191 decimal) in Unicode, ISO 8859-1, and Windows-1252. The code point 0xBF in CP437 corresponds to the character "┐", which is BOX DRAWINGS LIGHT DOWN AND LEFT (code point U+2510).
As long as you're using the Windows console, you can display only the characters in CP437 and no others. If you want to display other Unicode characters, you'll need to use a different environment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With