I'm trying to create a custom print stream that can print localized messages to the console. I encountered a problem doing this on Windows. Here is what I'm attempting to do
In this code, I tried to do the above steps but it fails miserably. Strangely the default System.out.println call works correctly. However, I want to use a custom print stream and not rely on the default System.out.
Can someone explain how I can print unicode to the console using my custom print stream? And why is the default System.out already equipped to print things correctly?
Here is my code - I compiled it and ran it from the command line. I set my system locale to zh-CN beforehand.
public static void main(String[] args) throws Exception{
Charset defaultCharset = Charset.defaultCharset();
System.out.println(defaultCharset);
// charset is windows-1252
String unicodeMessage =
"\u4e16\u754c\u4f60\u597d\uff01";
System.out.println(unicodeMessage);
// string is printed correctly using System.out (世界你好!)
byte[] sourceBytes = unicodeMessage.getBytes("UTF-8");
String data = new String(sourceBytes , defaultCharset.name());
PrintStream out = new PrintStream(System.out, true, defaultCharset.name());
out.println(data);
// prints gibberish: ??–????????????
}
Use the "\u" escape sequence to print Unicode characters In a string, place "\u" before four hexadecimal digits that represent a Unicode code point. Use print() to print the string.
Unicode text can be encoded in various formats: The two most important ones are UTF-8 and UTF-16. In C++ Windows code there's often a need to convert between UTF-8 and UTF-16, because Unicode-enabled Win32 APIs use UTF-16 as their native Unicode encoding.
Unicode characters can then be entered by holding down Alt , and typing + on the numeric keypad, followed by the hexadecimal code – using the numeric keypad for digits from 0 to 9 and letter keys for A to F – and then releasing Alt .
To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.
windows-1252 charset is the problem here. We need to use UTF-8 charset to print. Following worked for me:
public static void main(String[] args) throws Exception{
Charset utf8Charset = Charset.forName("UTF-8");
Charset defaultCharset = Charset.defaultCharset();
System.out.println(defaultCharset);
// charset is windows-1252
String unicodeMessage = "\u4e16\u754c\u4f60\u597d\uff01";
System.out.println(unicodeMessage);
// string is printed correctly using System.out (世界你好!)
byte[] sourceBytes = unicodeMessage.getBytes("UTF-8");
String data = new String(sourceBytes , defaultCharset.name());
PrintStream out = new PrintStream(System.out, true, utf8Charset.name());
out.println(data);
}
You have a number of issues and misunderstandings. Firstly,
byte[] sourceBytes = unicodeMessage.getBytes("UTF-8");
String data = new String(sourceBytes , defaultCharset.name());
data
is now full of mojibake - you've decoded UTF-8 as windows-1252. You then print this string to through a UTF-8 encoder. System.out then encodes for your console's codepage. It's got three levels of broken.
Now, the reason System.out.println(unicodeMessage);
works is because you set your locale correctly. Java uses this (the codepage of the console), not defaultCharset to setup the console.
The problem, you'll face is the Window console doesn't support UTF-8. You'll be ok printing characters from your codepage but not others. Find another solution, such as writing to a file or sending the results to a web page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With