Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different results reading file with Files.newBufferedReader() and constructing readers directly

It seems that Files.newBufferedReader() is more strict about UTF-8 than the naive alternative.

If I create a file with a single byte 128---so, not a valid UTF-8 character---it will happily be read if I construct an BufferedReader on an InputStreamReader on the result of Files.newInputStream(), but with Files.newBufferedReader() an exception is thrown.

This code

try (
    InputStream in = Files.newInputStream(path);
    Reader isReader = new InputStreamReader(in, "UTF-8");
    Reader reader = new BufferedReader(isReader);
) {
    System.out.println((char) reader.read());
}

try (
    Reader reader = Files.newBufferedReader(path);
) {
    System.out.println((char) reader.read());
}

has this result:

�
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.read(BufferedReader.java:182)
    at TestUtf8.main(TestUtf8.java:28)

Is this documented? And is it possible to get the lenient behavior with Files.newBufferedReader()?

like image 308
Robert Tupelo-Schneck Avatar asked Jan 19 '16 20:01

Robert Tupelo-Schneck


1 Answers

The difference is in how the CharsetDecoder used to decode the UTF-8 is constructed in the two cases.

For new InputStreamReader(in, "UTF-8") the decoder is constructed using:

Charset cs = Charset.forName("UTF-8");

CharsetDecoder decoder = cs.newDecoder()
          .onMalformedInput(CodingErrorAction.REPLACE)
          .onUnmappableCharacter(CodingErrorAction.REPLACE);

This is explicitly specifying that invalid sequences are just replaced with the standard replacement character.

Files.newBufferedReader(path) uses:

Charset cs = StandardCharsets.UTF_8;

CharsetDecoder decoder = cs.newDecoder();

In this case onMalformedInput and onUnmappableCharacter are not being called so you get the default action which is to throw the exception you are seeing.

There does not seem to be a way to change what Files.newBufferedReader does. I didn't see anything documenting this while looking through the code.

like image 93
greg-449 Avatar answered Sep 22 '22 00:09

greg-449