Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Java's CharsetEncoder define .onMalformedInput()/CharsetDecoder define .onUnmappableCharacter()?

A CharsetDecoder basically helps decoding a sequence of bytes into a sequence of chars (see Charset#newDecoder()). On the opposite side, a CharsetEncoder (see Charset#newEncoder()) does the reverse: take a sequence of chars, and encode them into a sequence of bytes.

CharsetDecoder defines .onMalformedInput() and it seems logical (some byte sequence may not translate to a valid char sequence); but why .onUnmappableCharacter() since its input is a byte sequence?

Similarly, CharsetEncoder defines .onUnmappableCharacter() which is, here again, logical (for instance, if your charset is ASCII, you cannot encode ö); but why does it also define .onMalformedInput() since its input is a character sequence?

This is all the more intriguing that you cannot obtain an encoder from a decoder and vice versa, and none of these two classes seem to share a common ancestor...


EDIT 1

It is indeed possible to trigger .onMalformedInput() on a CharsetEncoder. You "just" have to provide an illegal char or char sequence. The program below relies on the fact that in UTF-16, a high surrogate must be followed by a low surrogate; here, a two-element char array is built with two high surrogates instead and an attempt to encode it is done. NOTE how the creation of a String from such an ill-formed char sequence throws no exception at all:

Code:

public static void main(final String... args)
    throws CharacterCodingException
{
    boolean found = false;
    char c = '.';

    for (int i = 0; i < 65536; i++) {
        if (Character.isHighSurrogate((char) i)) {
            c = (char) i;
            found = true;
            break;
        }
    }
    if (!found)
        throw new IllegalStateException();

    System.out.println("found: " + Integer.toHexString(c));
    final char[] foo = { c, c };

    new String(foo); // <-- DOES NOT THROW AN EXCEPTION!!!

    final CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder()
        .onMalformedInput(CodingErrorAction.REPORT);

    encoder.encode(CharBuffer.wrap(foo));
}

Output:

found: d800
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
    at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:798)
    at com.github.fge.largetext.LargeText.main(LargeText.java:166)

EDIT 2 But now, how about the reverse? From @Kairos's answer below, quoting the manpage:

UnmappableCharacterException - If the byte sequence starting at the input buffer's current position cannot be mapped to an equivalent character sequence and the current unmappable-character action is CodingErrorAction.REPORT

Now, what is "cannot be mapped to an equivalent character sequence"?

I play quite a bit with CharsetDecoders in this project and have yet to produce such an error. I know how to reproduce an error in which, for instance, you only have two bytes out of a three-byte UTF-8 sequence but this triggers a MalformedInputException. All you have to do in this case is restart the decoding from the last known position of the ByteBuffer.

Triggering an UnmappableCharacterException would basically mean that the character encoding itself would allow for an illegal char to be generated; or an illegal Unicode code point.

Is this possible at all?

like image 387
fge Avatar asked Apr 05 '14 20:04

fge


1 Answers

Per the docs for CharsetEncoder.encode() it states that it throws a MalformedInputException

If the character sequence starting at the input buffer's current position is not a legal sixteen-bit Unicode sequence and the current malformed-input action is CodingErrorAction.REPORT

So, you are given the option of providing a CodingErrorAction by utilizing onMalformedInput so that if you encounter one of these illegal sixteen-bit Unicode sequences, the provided action will be performed.

Similarly for CharsetDecoder.decode()

UnmappableCharacterException - If the byte sequence starting at the input buffer's current position cannot be mapped to an equivalent character sequence and the current unmappable-character action is CodingErrorAction.REPORT

like image 147
Reuben Tanner Avatar answered Nov 06 '22 04:11

Reuben Tanner