Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java, Unicode, UTF-8, and Windows Command Prompt

I have a jar file that is supposed to read a UTF-8 encoded file—that I wrote in a text editor under Windows—and display the characters to the screen. Under OS X and Linux this works flawlessly. I'm having a bit of trouble getting it to work under Windows though... I've defined a reader and writer like so:

FileInputStream file = new FileInputStream(args[0]);
InputStreamReader reader = new InputStreamReader(file, "UTF8");

PrintStream writer = new PrintStream(System.out, true, "UTF8");

I've also changed the command prompt font to Lucida Console and the character encoding to UTF-8 with chcp 65001, in that order.

Now, when I run java -jar Read.jar file.txt, the prompt splurges this out.

áéí
ñóú
[]óú
[]

However, if I run type file.txt, the prompt correctly displays the file's contents.

áéí
ñóú

I've tried saving my file with and without BOM, but that hasn't made a difference. (UTF-8 doesn't even need BOM because it's lack of endianness, correct?) I've tried compiling with javac -encoding utf8 *.java, but the same thing happens.

I'm out of ideas now. Anyone care to help?

like image 998
425nesp Avatar asked Aug 13 '12 02:08

425nesp


People also ask

Does Windows command prompt support Unicode?

CMD.exe is a just one of programs which are ready to “work inside” a console (“console applications”). AFAIK, CMD has perfect support for Unicode; you can enter/output all Unicode chars when any codepage is active.

Does CMD support ASCII?

And the problem in CMD is: The encoding of cmd does not support non-ASCII character.


1 Answers

Code page 65001 is broken. The MS C runtime stdio functions return inaccurate counts of bytes read and written when run under 65001, which leads to strange behaviours like this one.

It's not fixable - you can't reliably use the Windows console for Unicode I/O from applications that use the C stdlib byte-I/O functions (which includes Java). You can hack it by calling the Win32 API function WriteConsoleW to get Unicode content directly to the Console, but then you have to worry about detecting when stdout actually is a console (not redirected to file).

This is a long-standing source of woe which MS shows no interest in fixing.

like image 155
bobince Avatar answered Oct 16 '22 10:10

bobince