Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error which "shouldn't happen" caused by MalformedInputException when reading file to string with UTF-16

Path file = Paths.get("New Text Document.txt");
try {
    System.out.println(Files.readString(file, StandardCharsets.UTF_8));
    System.out.println(Files.readString(file, StandardCharsets.UTF_16));
} catch (Exception e) {
    System.out.println("yep it's an exception");
}

might yield

some text
Exception in thread "main" java.lang.Error: java.nio.charset.MalformedInputException: Input length = 1
    at java.base/java.lang.String.decodeWithDecoder(String.java:1212)
    at java.base/java.lang.String.newStringNoRepl1(String.java:786)
    at java.base/java.lang.String.newStringNoRepl(String.java:738)
    at java.base/java.lang.System$2.newStringNoRepl(System.java:2390)
    at java.base/java.nio.file.Files.readString(Files.java:3369)
    at test.Test2.main(Test2.java:13)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
    at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
    at java.base/java.lang.String.decodeWithDecoder(String.java:1205)
    ... 5 more

This error "shouldn't happen". Here's the java.lang.String method:

private static int decodeWithDecoder(CharsetDecoder cd, char[] dst, byte[] src, int offset, int length) {
    ByteBuffer bb = ByteBuffer.wrap(src, offset, length);
    CharBuffer cb = CharBuffer.wrap(dst, 0, dst.length);
    try {
        CoderResult cr = cd.decode(bb, cb, true);
        if (!cr.isUnderflow())
            cr.throwException();
        cr = cd.flush(cb);
        if (!cr.isUnderflow())
            cr.throwException();
    } catch (CharacterCodingException x) {
        // Substitution is always enabled,
        // so this shouldn't happen
        throw new Error(x);
    }
    return cb.position();
}

EDIT: As @user16320675 noted, this happens when an UTF-8 file with an odd number of characters is read as UTF-16. With an even number of characters, neither the Error nor the MalformedInputException happens. Why the Error?

like image 823
Blrp Avatar asked Sep 02 '25 03:09

Blrp


1 Answers

This is a bug introduced in JDK 17.

Prior to this version, this Error throwing code was only used for the String constructor which indeed can never encounter a CharacterCodingException because it configures the decoder to substitute illegal content.

E.g., when you use

String s = new String(new byte[] { 50 }, StandardCharsets.UTF_16);
System.out.println(s.chars()
    .mapToObj(c -> String.format(" U+%04x", c)).collect(Collectors.joining("", s, "")));

you’ll get

� U+fffd

In JDK 17, the code has been refactored and code duplication removed. Now, the same method decodeWithDecoder will be used for both, the String constructor and Files.readString. But Files.readString is supposed to report encoding errors instead of substituting the problematic content. Therefore, the decoder has not been configured to substitute malformed content, intentionally.

When you run

Path p = Files.write(Files.createTempFile("charset", "test"), new byte[] { 50 });
try(Closeable c = () -> Files.delete(p)) {
    String s = Files.readString(p, StandardCharsets.UTF_16);
}

under JDK 16, you’ll correctly get

Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
        at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
        at java.base/java.lang.StringCoding.newStringNoRepl1(StringCoding.java:1053)
        at java.base/java.lang.StringCoding.newStringNoRepl(StringCoding.java:1003)
        at java.base/java.lang.System$2.newStringNoRepl(System.java:2265)
        at java.base/java.nio.file.Files.readString(Files.java:3353)
        at first.test17.CharsetProblem.main(CharsetProblem.java:23)

The now-removed dedicated routine threw the MalformedInputException encapsulated in an IllegalArgumentException. The immediate caller looks like

/*
 * Throws CCE, instead of replacing, if unmappable.
 */
static byte[] getBytesNoRepl(String s, Charset cs) throws CharacterCodingException {
    try {
        return getBytesNoRepl1(s, cs);
    } catch (IllegalArgumentException e) {
        //getBytesNoRepl1 throws IAE with UnmappableCharacterException or CCE as the cause
        Throwable cause = e.getCause();
        if (cause instanceof UnmappableCharacterException) {
            throw (UnmappableCharacterException)cause;
        }
        throw (CharacterCodingException)cause;
    }
}

and there lies the problem. When the code was refactored to use the same routine for the String constructor and Files.readString, this caller was not adapted. It still expects an IllegalArgumentException where the common method now throws an Error. Or the common method should have been adapted to better suit both cases, e.g. by having a parameter telling whether CharacterCodingException exceptions should be possible or not.


It’s worth noting that the charset decoding code has a lot of optimizations and shortcuts for commonly used charsets. That’s why you rarely get to this specific method. UTF-16 seems to be one (if not the) rare case where this method is used.

like image 119
Holger Avatar answered Sep 04 '25 23:09

Holger