How does Java determine the encoding used for System.out
?
Given the following class:
import java.io.File;
import java.io.PrintWriter;
public class Foo
{
public static void main(String[] args) throws Exception
{
String s = "xxäñxx";
System.out.println(s);
PrintWriter out = new PrintWriter(new File("test.txt"), "UTF-8");
out.println(s);
out.close();
}
}
It is saved as UTF-8 and compiled with javac -encoding UTF-8 Foo.java
on a Windows system.
Afterwards on a git-bash console (using UTF-8 charset) I do:
$ java Foo
xxõ±xx
$ java -Dfile.encoding=UTF-8 Foo
xxäñxx
$ cat test.txt
xxäñxx
$ java Foo | cat
xxäñxx
$ java -Dfile.encoding=UTF-8 Foo | cat
xxäñxx
What is going on here?
Obviously java checks if it is connected to a terminal and is changing its encoding in that case. Is there a way to force Java to simply output plain UTF-8?
I tried the same with the cmd console, too. Redirecting STDOUT does not seem to make any difference there. Without the file.encoding parameter it outputs ansi encoding with the parameter it outputs utf8 encoding.
The native character encoding of the Java programming language is UTF-16.
Java supports a wide array of encodings and their conversions to each other. The class Charset defines a set of standard encodings which every implementation of Java platform is mandated to support. This includes US-ASCII, ISO-8859-1, UTF-8, and UTF-16 to name a few.
In case we start JVM starts up using some scripts and tools, the default charset can be set using the environment variable JAVA_TOOL_OPTIONS to -Dfile. encoding = ”UTF-16” or any other which is then used up by the program whenever JVM starts in the machine.
UTF-8 is a variable width character encoding. UTF-8 has the ability to be as condensed as ASCII but can also contain any Unicode characters with some increase in the size of the file. UTF stands for Unicode Transformation Format. The '8' signifies that it allocates 8-bit blocks to denote a character.
I'm assuming that your console still runs under cmd.exe. I doubt your console is really expecting UTF-8 - I expect it is really an OEM DOS encoding (e.g. 850 or 437.)
Java will encode bytes using the default encoding set during JVM initialization.
Reproducing on my PC:
java Foo
Java encodes as windows-1252; console decodes as IBM850. Result: Mojibake
java -Dfile.encoding=UTF-8 Foo
Java encodes as UTF-8; console decodes as IBM850. Result: Mojibake
cat test.txt
cat decodes file as UTF-8; cat encodes as IBM850; console decodes as IBM850.
java Foo | cat
Java encodes as windows-1252; cat decodes as windows-1252; cat encodes as IBM850; console decodes as IBM850
java -Dfile.encoding=UTF-8 Foo | cat
Java encodes as UTF-8; cat decodes as UTF-8; cat encodes as IBM850; console decodes as IBM850
This implementation of cat must use heuristics to determine if the character data is UTF-8 or not, then transcodes the data from either UTF-8 or ANSI (e.g. windows-1252) to the console encoding (e.g. IBM850.)
This can be confirmed with the following commands:
$ java HexDump utf8.txt
78 78 c3 a4 c3 b1 78 78
$ cat utf8.txt
xxäñxx
$ java HexDump ansi.txt
78 78 e4 f1 78 78
$ cat ansi.txt
xxäñxx
The cat command can make this determination because e4 f1
is not a valid UTF-8 sequence.
You can correct the Java output by:
HexDump is a trivial Java application:
import java.io.*;
class HexDump {
public static void main(String[] args) throws IOException {
try (InputStream in = new FileInputStream(args[0])) {
int r;
while((r = in.read()) != -1) {
System.out.format("%02x ", 0xFF & r);
}
System.out.println();
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With