Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode input in a console application in Java

I have been trying to retrieve "unicode user input" in my Java application for a small utility snippet. The problem is, it seems to be working on Ubuntu "out of the box" which has I guess OS wide encoding at UTF-8 but doesn't work on Windows when run from "cmd". The code in consideration is as follows:

public class SerTest {

    public static void main(String[] args) throws Exception {
        testUnicode();
    }

    public static void testUnicode() throws Exception {
        System.out.println("Default charset: " +
           Charset.defaultCharset().name());
        BufferedReader in  =
           new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
        System.out.printf("Enter 'абвгд эюя': ");
        String line = in.readLine();
        String s = "абвгд эюя";
        byte[] sBytes = s.getBytes();
        System.out.println("strg bytes: " + Arrays.toString(sBytes));
        byte[] lineBytes = line.getBytes();
        System.out.println("line bytes: " + Arrays.toString(lineBytes));
        PrintStream out = new PrintStream(System.out, true, "UTF-8");
        out.print("--->" + s + "<----\n");
        out.print("--->" + line + "<----\n");
    }

}

Output on Ubuntu (without any changes to configuration):

me@host> javac SerTest.java  && java SerTest
Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----

Output on windows CMD prompt (in no way affected by JAVA_TOOL_OPTIONS):

E:\>chcp 65001
Active code page: 65001

E:\>java -Dfile.encoding=utf8 SerTest
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
Default charset: UTF-8
Enter 'абвгд эюя': юя': ': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Exception in thread "main" java.lang.NullPointerException
        at SerTest.testUnicode(SerTest.java:26) # byte[] lineBytes = line.getBytes();
        at SerTest.main(SerTest.java:15)

Output in Eclipse console (after using JAVA_TOOL_OPTIONS):

Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----

On Eclipse console, it is working because I have added a system wide environment variable (JAVA_TOOL_OPTIONS) which if possible I would like to avoid.

Output in Eclipse console (after removing JAVA_TOOL_OPTIONS):

Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-61, -112, -62, -80, -61, -112, -62, -79, -61, -112, -62, -78, -61, -112, -62, -77, -61, -112, -62, -76, 32, -61, -111, -17, -65, -67, -61, -111, -59, -67, -61, -111, -17, -65, -67]
--->абвгд эюя<----
--->абвгд �ю�<----

So my question is: what exactly is going on here? What code changes would be required to ensure that this snippet works for all sorts of "Unicode" input?

Sorry for the long winded question and thanks in advance,
Sasuke

like image 598
sasuke Avatar asked Dec 29 '11 14:12

sasuke


2 Answers

Some notes:

  • -Dfile.encoding=utf8 is not supported and may cause unintended side-effects:

The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

  • The Console class will detect and use the terminal encoding but doesn't support 65001 (UTF-8) on Windows - at least, it didn't the last time I tried it

I believe that the correct, documented way to use Unicode with cmd.exe is to use WriteConsoleW and ReadConsoleW.

I wrote a couple of blog posts when I was looking at this:

  • I18N: Unicode at the Windows command prompt
  • Java: Unicode on the Windows command line
like image 104
McDowell Avatar answered Nov 12 '22 04:11

McDowell


NPE is throws when you are trying to call Arrays.toString(lineBytes), that means that lineBytes is null.

lineBytes holds value: line.getBytes(). getBytes() can return null only if UnsupportedEncodingException is throws inside.

It happens on windows because windows command prompt does not support unicode by default. This works on Ubuntu because its command prompt is fully unicode enabled. It works partially with eclipse because Eclipse's console window is a java component that supports unicode for input and does it for output with JAVA_TOOL_OPTIONS.

The bottom line is that you wish to configure windows command prompt to be able to use unicode characters. I saw several discussions on this topic. Please take a look on this one: Unicode characters in Windows command line - how?

I hope this will help you.

like image 22
AlexR Avatar answered Nov 12 '22 02:11

AlexR