Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which encoding does Process.getInputStream() use?

In a Java program, I spawn a new Process via ProcessBuilder.

args[0] = directory.getAbsolutePath() + File.separator + program;
ProcessBuilder pb = new ProcessBuilder(args);
pb.directory(directory);
final Process process = pb.start();

Then, I read the process standard output with a new Thread

new Thread() {
    public void run() {
        BufferedReader reader = new BufferedReader(
            new InputStreamReader(process.getInputStream()));
        String line = "";
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
    }
}.start();

However, when the process outputs non-ASCII characters (such as 'é'), the line has character '\uFFFD' instead.

What is the encoding in the InputStream returned by getInputStream (my platform is Windows in Europe)?

How can I change things so that line contains the expected data (i.e. '\u00E9' for 'é')?

Edit: I tried new InputStreamReader(...,"UTF-8"): é becomes \uFFFD

like image 627
rds Avatar asked Dec 06 '11 10:12

rds


4 Answers

An InputStream is a binary stream, so there is no encoding. When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).

If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.

If you know what encoding to use (and you really have to know):

new InputStreamReader(process.getInputStream(), "UTF-8") // for example
like image 97
Thilo Avatar answered Nov 11 '22 22:11

Thilo


Interestingly enough, when running on Windows:

ProcessBuilder pb = new ProcessBuilder("cmd", "/c dir");
Process process = pb.start();

Then CP437 code page works quite well for

new InputStreamReader(process.getInputStream(), "CP437");
like image 29
jan.supol Avatar answered Nov 11 '22 23:11

jan.supol


As I understand, an operation system streams are byte-streams, there are no characters here. The InputStreamReader constructor uses jvm default character set java.nio.charset.Charset#defaultCharset(), you could use another constructor to explicitly specify a character set.

like image 4
kan Avatar answered Nov 11 '22 21:11

kan


According to http://www.fileformat.info/info/unicode/char/e9/index.htm '\uFFFD' is a unicode code for character 'é'. It actually means that you are reading the stream correctly. Your problem is in writing.

Windows console does not support unicode by default. So, if you want to test your code open file and write your stream there. But do not forget to set the encoding UTF-8.

like image 2
AlexR Avatar answered Nov 11 '22 23:11

AlexR