Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing unicode to console

I'm trying to create a custom print stream that can print localized messages to the console. I encountered a problem doing this on Windows. Here is what I'm attempting to do

  • I have a unicode string
  • Convert unicode string to bytes using UTF-8 encoding
  • Convert bytes to a new string with console encoding
  • Print new string to console with console encoding

In this code, I tried to do the above steps but it fails miserably. Strangely the default System.out.println call works correctly. However, I want to use a custom print stream and not rely on the default System.out.

Can someone explain how I can print unicode to the console using my custom print stream? And why is the default System.out already equipped to print things correctly?

Here is my code - I compiled it and ran it from the command line. I set my system locale to zh-CN beforehand.

public static void main(String[] args) throws Exception{
    Charset defaultCharset = Charset.defaultCharset();
    System.out.println(defaultCharset);
    // charset is windows-1252

    String unicodeMessage =
            "\u4e16\u754c\u4f60\u597d\uff01";

    System.out.println(unicodeMessage);
    // string is printed correctly using System.out (世界你好!)


    byte[] sourceBytes = unicodeMessage.getBytes("UTF-8");
    String data = new String(sourceBytes , defaultCharset.name());

    PrintStream out = new PrintStream(System.out, true, defaultCharset.name());
    out.println(data);
    // prints gibberish: ??–????????????
}
like image 934
HAL Avatar asked Dec 18 '15 17:12

HAL


People also ask

How do I print Unicode?

Use the "\u" escape sequence to print Unicode characters In a string, place "\u" before four hexadecimal digits that represent a Unicode code point. Use print() to print the string.

Can we use Unicode in C++?

Unicode text can be encoded in various formats: The two most important ones are UTF-8 and UTF-16. In C++ Windows code there's often a need to convert between UTF-8 and UTF-16, because Unicode-enabled Win32 APIs use UTF-16 as their native Unicode encoding.

How do I connect to Unicode?

Unicode characters can then be entered by holding down Alt , and typing + on the numeric keypad, followed by the hexadecimal code – using the numeric keypad for digits from 0 to 9 and letter keys for A to F – and then releasing Alt .

How do I insert a Unicode character in C++?

To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.


2 Answers

windows-1252 charset is the problem here. We need to use UTF-8 charset to print. Following worked for me:

public static void main(String[] args) throws Exception{
    Charset utf8Charset = Charset.forName("UTF-8");
    Charset defaultCharset = Charset.defaultCharset();
    System.out.println(defaultCharset);
    // charset is windows-1252

    String unicodeMessage = "\u4e16\u754c\u4f60\u597d\uff01";

    System.out.println(unicodeMessage);
    // string is printed correctly using System.out (世界你好!)


    byte[] sourceBytes = unicodeMessage.getBytes("UTF-8");
    String data = new String(sourceBytes , defaultCharset.name());

    PrintStream out = new PrintStream(System.out, true, utf8Charset.name());
    out.println(data);
}
like image 173
Darshan Mehta Avatar answered Oct 21 '22 06:10

Darshan Mehta


You have a number of issues and misunderstandings. Firstly,

byte[] sourceBytes = unicodeMessage.getBytes("UTF-8");
String data = new String(sourceBytes , defaultCharset.name());

data is now full of mojibake - you've decoded UTF-8 as windows-1252. You then print this string to through a UTF-8 encoder. System.out then encodes for your console's codepage. It's got three levels of broken.

Now, the reason System.out.println(unicodeMessage); works is because you set your locale correctly. Java uses this (the codepage of the console), not defaultCharset to setup the console.

The problem, you'll face is the Window console doesn't support UTF-8. You'll be ok printing characters from your codepage but not others. Find another solution, such as writing to a file or sending the results to a web page.

like image 44
Alastair McCormack Avatar answered Oct 21 '22 06:10

Alastair McCormack