When I open cmd.exe in Windows, what encoding is it using?
How can I check which encoding it is currently using? Does it depend on my regional setting or are there any environment variables to check?
What happens when you type a file with a certain encoding? Sometimes I get garbled characters (incorrect encoding used) and sometimes it kind of works. However I don't trust anything as long as I don't know what's going on. Can anyone explain?
They use almost entirely C, C++, and C# for Windows. Some areas of code are hand tuned/hand written assembly.
Syntax CHCP code_page Key code_page A code page number (e.g. 437) This command is rarely required as most GUI programs and PowerShell now support Unicode. When working with characters outside the ASCII range of 0-127, such as some box characters, the choice of code page will determine the set of characters displayed.
The character set most commonly used in computers today is Unicode, a global standard for character encoding. Internally, Windows applications use the UTF-16 implementation of Unicode.
Yes, it’s frustrating—sometimes type
and other programs print gibberish, and sometimes they do not.
First of all, Unicode characters will only display if the current console font contains the characters. So use a TrueType font like Lucida Console instead of the default Raster Font.
But if the console font doesn’t contain the character you’re trying to display, you’ll see question marks instead of gibberish. When you get gibberish, there’s more going on than just font settings.
When programs use standard C-library I/O functions like printf
, the program’s output encoding must match the console’s output encoding, or you will get gibberish. chcp
shows and sets the current codepage. All output using standard C-library I/O functions is treated as if it is in the codepage displayed by chcp
.
Matching the program’s output encoding with the console’s output encoding can be accomplished in two different ways:
A program can get the console’s current codepage using chcp
or GetConsoleOutputCP
, and configure itself to output in that encoding, or
You or a program can set the console’s current codepage using chcp
or SetConsoleOutputCP
to match the default output encoding of the program.
However, programs that use Win32 APIs can write UTF-16LE strings directly to the console with WriteConsoleW
. This is the only way to get correct output without setting codepages. And even when using that function, if a string is not in the UTF-16LE encoding to begin with, a Win32 program must pass the correct codepage to MultiByteToWideChar
. Also, WriteConsoleW
will not work if the program’s output is redirected; more fiddling is needed in that case.
type
works some of the time because it checks the start of each file for a UTF-16LE Byte Order Mark (BOM), i.e. the bytes 0xFF 0xFE
. If it finds such a mark, it displays the Unicode characters in the file using WriteConsoleW
regardless of the current codepage. But when type
ing any file without a UTF-16LE BOM, or for using non-ASCII characters with any command that doesn’t call WriteConsoleW
—you will need to set the console codepage and program output encoding to match each other.
How can we find this out?
Here’s a test file containing Unicode characters:
ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好
Here’s a Java program to print out the test file in a bunch of different Unicode encodings. It could be in any programming language; it only prints ASCII characters or encoded bytes to stdout
.
import java.io.*; public class Foo { private static final String BOM = "\ufeff"; private static final String TEST_STRING = "ASCII abcde xyz\n" + "German äöü ÄÖÜ ß\n" + "Polish ąęźżńł\n" + "Russian абвгдеж эюя\n" + "CJK 你好\n"; public static void main(String[] args) throws Exception { String[] encodings = new String[] { "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE" }; for (String encoding: encodings) { System.out.println("== " + encoding); for (boolean writeBom: new Boolean[] {false, true}) { System.out.println(writeBom ? "= bom" : "= no bom"); String output = (writeBom ? BOM : "") + TEST_STRING; byte[] bytes = output.getBytes(encoding); System.out.write(bytes); FileOutputStream out = new FileOutputStream("uc-test-" + encoding + (writeBom ? "-bom.txt" : "-nobom.txt")); out.write(bytes); out.close(); } } } }
The output in the default codepage? Total garbage!
Z:\andrew\projects\sx\1259084>chcp Active code page: 850 Z:\andrew\projects\sx\1259084>java Foo == UTF-8 = no bom ASCII abcde xyz German ├ñ├Â├╝ ├ä├û├£ ├ƒ Polish ─à─Ö┼║┼╝┼ä┼é Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ CJK õ¢áÕÑ¢ = bom ´╗┐ASCII abcde xyz German ├ñ├Â├╝ ├ä├û├£ ├ƒ Polish ─à─Ö┼║┼╝┼ä┼é Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ CJK õ¢áÕÑ¢ == UTF-16LE = no bom A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺↓☺z☺|☺D☺B☺ R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦ C J K `O}Y = bom ■A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺↓☺z☺|☺D☺B☺ R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦ C J K `O}Y == UTF-16BE = no bom A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣☺↓☺z☺|☺D☺B R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O C J K O`Y} = bom ■ A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣☺↓☺z☺|☺D☺B R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O C J K O`Y} == UTF-32LE = no bom A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺ R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N ♦ O♦ C J K `O }Y = bom ■ A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺ R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N ♦ O♦ C J K `O }Y == UTF-32BE = no bom A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N ♦O C J K O` Y} = bom ■ A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N ♦O C J K O` Y}
However, what if we type
the files that got saved? They contain the exact same bytes that were printed to the console.
Z:\andrew\projects\sx\1259084>type *.txt uc-test-UTF-16BE-bom.txt ■ A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣☺↓☺z☺|☺D☺B R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O C J K O`Y} uc-test-UTF-16BE-nobom.txt A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣☺↓☺z☺|☺D☺B R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O C J K O`Y} uc-test-UTF-16LE-bom.txt ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好 uc-test-UTF-16LE-nobom.txt A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺↓☺z☺|☺D☺B☺ R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦ C J K `O}Y uc-test-UTF-32BE-bom.txt ■ A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N ♦O C J K O` Y} uc-test-UTF-32BE-nobom.txt A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N ♦O C J K O` Y} uc-test-UTF-32LE-bom.txt A S C I I a b c d e x y z G e r m a n ä ö ü Ä Ö Ü ß P o l i s h ą ę ź ż ń ł R u s s i a n а б в г д е ж э ю я C J K 你 好 uc-test-UTF-32LE-nobom.txt A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺ R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N ♦ O♦ C J K `O }Y uc-test-UTF-8-bom.txt ´╗┐ASCII abcde xyz German ├ñ├Â├╝ ├ä├û├£ ├ƒ Polish ─à─Ö┼║┼╝┼ä┼é Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ CJK õ¢áÕÑ¢ uc-test-UTF-8-nobom.txt ASCII abcde xyz German ├ñ├Â├╝ ├ä├û├£ ├ƒ Polish ─à─Ö┼║┼╝┼ä┼é Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ CJK õ¢áÕÑ¢
The only thing that works is UTF-16LE file, with a BOM, printed to the console via type
.
If we use anything other than type
to print the file, we get garbage:
Z:\andrew\projects\sx\1259084>copy uc-test-UTF-16LE-bom.txt CON ■A S C I I a b c d e x y z G e r m a n õ ÷ ³ ─ Í ▄ ▀ P o l i s h ♣☺↓☺z☺|☺D☺B☺ R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦ C J K `O}Y 1 file(s) copied.
From the fact that copy CON
does not display Unicode correctly, we can conclude that the type
command has logic to detect a UTF-16LE BOM at the start of the file, and use special Windows APIs to print it.
We can see this by opening cmd.exe
in a debugger when it goes to type
out a file:
After type
opens a file, it checks for a BOM of 0xFEFF
—i.e., the bytes 0xFF 0xFE
in little-endian—and if there is such a BOM, type
sets an internal fOutputUnicode
flag. This flag is checked later to decide whether to call WriteConsoleW
.
But that’s the only way to get type
to output Unicode, and only for files that have BOMs and are in UTF-16LE. For all other files, and for programs that don’t have special code to handle console output, your files will be interpreted according to the current codepage, and will likely show up as gibberish.
You can emulate how type
outputs Unicode to the console in your own programs like so:
#include <stdio.h> #define UNICODE #include <windows.h> static LPCSTR lpcsTest = "ASCII abcde xyz\n" "German äöü ÄÖÜ ß\n" "Polish ąęźżńł\n" "Russian абвгдеж эюя\n" "CJK 你好\n"; int main() { int n; wchar_t buf[1024]; HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE); n = MultiByteToWideChar(CP_UTF8, 0, lpcsTest, strlen(lpcsTest), buf, sizeof(buf)); WriteConsole(hConsole, buf, n, &n, NULL); return 0; }
This program works for printing Unicode on the Windows console using the default codepage.
For the sample Java program, we can get a little bit of correct output by setting the codepage manually, though the output gets messed up in weird ways:
Z:\andrew\projects\sx\1259084>chcp 65001 Active code page: 65001 Z:\andrew\projects\sx\1259084>java Foo == UTF-8 = no bom ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好 ж эюя CJK 你好 你好 好 � = bom ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好 еж эюя CJK 你好 你好 好 � == UTF-16LE = no bom A S C I I a b c d e x y z …
However, a C program that sets a Unicode UTF-8 codepage:
#include <stdio.h> #include <windows.h> int main() { int c, n; UINT oldCodePage; char buf[1024]; oldCodePage = GetConsoleOutputCP(); if (!SetConsoleOutputCP(65001)) { printf("error\n"); } freopen("uc-test-UTF-8-nobom.txt", "rb", stdin); n = fread(buf, sizeof(buf[0]), sizeof(buf), stdin); fwrite(buf, sizeof(buf[0]), n, stdout); SetConsoleOutputCP(oldCodePage); return 0; }
does have correct output:
Z:\andrew\projects\sx\1259084>.\test ASCII abcde xyz German äöü ÄÖÜ ß Polish ąęźżńł Russian абвгдеж эюя CJK 你好
The moral of the story?
type
can print UTF-16LE files with a BOM regardless of your current codepageWriteConsoleW
.chcp
, and will probably still get weird output.Type
chcp
to see your current code page (as Dewfy already said).
Use
nlsinfo
to see all installed code pages and find out what your code page number means.
You need to have Windows Server 2003 Resource kit installed (works on Windows XP) to use nlsinfo
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With