A CharsetDecoder
basically helps decoding a sequence of bytes
into a sequence of char
s (see Charset#newDecoder()
). On the opposite side, a CharsetEncoder
(see Charset#newEncoder()
) does the reverse: take a sequence of char
s, and encode them into a sequence of byte
s.
CharsetDecoder
defines .onMalformedInput()
and it seems logical (some byte sequence may not translate to a valid char
sequence); but why .onUnmappableCharacter()
since its input is a byte sequence?
Similarly, CharsetEncoder
defines .onUnmappableCharacter()
which is, here again, logical (for instance, if your charset is ASCII, you cannot encode ö
); but why does it also define .onMalformedInput()
since its input is a character sequence?
This is all the more intriguing that you cannot obtain an encoder from a decoder and vice versa, and none of these two classes seem to share a common ancestor...
EDIT 1
It is indeed possible to trigger .onMalformedInput()
on a CharsetEncoder
. You "just" have to provide an illegal char
or char
sequence. The program below relies on the fact that in UTF-16, a high surrogate must be followed by a low surrogate; here, a two-element char array is built with two high surrogates instead and an attempt to encode it is done. NOTE how the creation of a String
from such an ill-formed char sequence throws no exception at all:
Code:
public static void main(final String... args)
throws CharacterCodingException
{
boolean found = false;
char c = '.';
for (int i = 0; i < 65536; i++) {
if (Character.isHighSurrogate((char) i)) {
c = (char) i;
found = true;
break;
}
}
if (!found)
throw new IllegalStateException();
System.out.println("found: " + Integer.toHexString(c));
final char[] foo = { c, c };
new String(foo); // <-- DOES NOT THROW AN EXCEPTION!!!
final CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT);
encoder.encode(CharBuffer.wrap(foo));
}
Output:
found: d800
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:798)
at com.github.fge.largetext.LargeText.main(LargeText.java:166)
EDIT 2 But now, how about the reverse? From @Kairos's answer below, quoting the manpage:
UnmappableCharacterException - If the byte sequence starting at the input buffer's current position cannot be mapped to an equivalent character sequence and the current unmappable-character action is CodingErrorAction.REPORT
Now, what is "cannot be mapped to an equivalent character sequence"?
I play quite a bit with CharsetDecoder
s in this project and have yet to produce such an error. I know how to reproduce an error in which, for instance, you only have two bytes out of a three-byte UTF-8 sequence but this triggers a MalformedInputException
. All you have to do in this case is restart the decoding from the last known position of the ByteBuffer
.
Triggering an UnmappableCharacterException
would basically mean that the character encoding itself would allow for an illegal char
to be generated; or an illegal Unicode code point.
Is this possible at all?
Per the docs for CharsetEncoder.encode() it states that it throws a MalformedInputException
If the character sequence starting at the input buffer's current position is not a legal sixteen-bit Unicode sequence and the current malformed-input action is CodingErrorAction.REPORT
So, you are given the option of providing a CodingErrorAction by utilizing onMalformedInput so that if you encounter one of these illegal sixteen-bit Unicode sequences, the provided action will be performed.
Similarly for CharsetDecoder.decode()
UnmappableCharacterException - If the byte sequence starting at the input buffer's current position cannot be mapped to an equivalent character sequence and the current unmappable-character action is CodingErrorAction.REPORT
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With