Path file = Paths.get("New Text Document.txt");
try {
System.out.println(Files.readString(file, StandardCharsets.UTF_8));
System.out.println(Files.readString(file, StandardCharsets.UTF_16));
} catch (Exception e) {
System.out.println("yep it's an exception");
}
might yield
some text
Exception in thread "main" java.lang.Error: java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.lang.String.decodeWithDecoder(String.java:1212)
at java.base/java.lang.String.newStringNoRepl1(String.java:786)
at java.base/java.lang.String.newStringNoRepl(String.java:738)
at java.base/java.lang.System$2.newStringNoRepl(System.java:2390)
at java.base/java.nio.file.Files.readString(Files.java:3369)
at test.Test2.main(Test2.java:13)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.lang.String.decodeWithDecoder(String.java:1205)
... 5 more
This error "shouldn't happen". Here's the java.lang.String
method:
private static int decodeWithDecoder(CharsetDecoder cd, char[] dst, byte[] src, int offset, int length) {
ByteBuffer bb = ByteBuffer.wrap(src, offset, length);
CharBuffer cb = CharBuffer.wrap(dst, 0, dst.length);
try {
CoderResult cr = cd.decode(bb, cb, true);
if (!cr.isUnderflow())
cr.throwException();
cr = cd.flush(cb);
if (!cr.isUnderflow())
cr.throwException();
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
throw new Error(x);
}
return cb.position();
}
EDIT: As @user16320675 noted, this happens when an UTF-8 file with an odd number of characters is read as UTF-16. With an even number of characters, neither the Error
nor the MalformedInputException
happens. Why the Error
?
This is a bug introduced in JDK 17.
Prior to this version, this Error
throwing code was only used for the String
constructor which indeed can never encounter a CharacterCodingException
because it configures the decoder to substitute illegal content.
E.g., when you use
String s = new String(new byte[] { 50 }, StandardCharsets.UTF_16);
System.out.println(s.chars()
.mapToObj(c -> String.format(" U+%04x", c)).collect(Collectors.joining("", s, "")));
you’ll get
� U+fffd
In JDK 17, the code has been refactored and code duplication removed. Now, the same method decodeWithDecoder
will be used for both, the String
constructor and Files.readString
. But Files.readString
is supposed to report encoding errors instead of substituting the problematic content. Therefore, the decoder has not been configured to substitute malformed content, intentionally.
When you run
Path p = Files.write(Files.createTempFile("charset", "test"), new byte[] { 50 });
try(Closeable c = () -> Files.delete(p)) {
String s = Files.readString(p, StandardCharsets.UTF_16);
}
under JDK 16, you’ll correctly get
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.lang.StringCoding.newStringNoRepl1(StringCoding.java:1053)
at java.base/java.lang.StringCoding.newStringNoRepl(StringCoding.java:1003)
at java.base/java.lang.System$2.newStringNoRepl(System.java:2265)
at java.base/java.nio.file.Files.readString(Files.java:3353)
at first.test17.CharsetProblem.main(CharsetProblem.java:23)
The now-removed dedicated routine threw the MalformedInputException
encapsulated in an IllegalArgumentException
. The immediate caller looks like
/*
* Throws CCE, instead of replacing, if unmappable.
*/
static byte[] getBytesNoRepl(String s, Charset cs) throws CharacterCodingException {
try {
return getBytesNoRepl1(s, cs);
} catch (IllegalArgumentException e) {
//getBytesNoRepl1 throws IAE with UnmappableCharacterException or CCE as the cause
Throwable cause = e.getCause();
if (cause instanceof UnmappableCharacterException) {
throw (UnmappableCharacterException)cause;
}
throw (CharacterCodingException)cause;
}
}
and there lies the problem. When the code was refactored to use the same routine for the String
constructor and Files.readString
, this caller was not adapted. It still expects an IllegalArgumentException
where the common method now throws an Error
. Or the common method should have been adapted to better suit both cases, e.g. by having a parameter telling whether CharacterCodingException
exceptions should be possible or not.
It’s worth noting that the charset decoding code has a lot of optimizations and shortcuts for commonly used charsets. That’s why you rarely get to this specific method. UTF-16 seems to be one (if not the) rare case where this method is used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With