I have a problem with CharsetDecoder
class.
First example of code (which works):
final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
final ByteBuffer b = ByteBuffer.allocate(3);
final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
for (int i=0; i<tab.length; i++){
b.put(tab, i, 1);
}
try {
b.flip();
System.out.println("a" + dec.decode(b).toString() + "a");
} catch (CharacterCodingException e1) {
e1.printStackTrace();
}
The result is a€a
But when i execute this code:
final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
final CharBuffer chars = CharBuffer.allocate(3);
final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
for (int i=0; i<tab.length; i++){
ByteBuffer buffer = ByteBuffer.wrap(tab, i, 1);
dec.decode(buffer, chars, i == 2);
}
dec.flush(chars);
System.out.println("a" + chars.toString() + "a");
The result is a
Why is not the same result?
How to use the method decode(ByteBuffer, CharBuffer, endOfInput)
of class CharsetDecoder
in order to retrieve the result a€a
?
-- EDIT --
So with code of Jesper I do that. It's no perfect but works with a step
= 1, 2 and 3
final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
final CharBuffer chars = CharBuffer.allocate(6);
final byte[] tab = new byte[]{(byte)97, (byte)-30, (byte)-126, (byte)-84, (byte)97, (byte)97}; //char €
final ByteBuffer buffer = ByteBuffer.allocate(10);
final int step = 3;
for (int i = 0; i < tab.length; i++) {
// Add the next byte to the buffer
buffer.put(tab, i, step);
i+=step-1;
// Remember the current position
final int pos = buffer.position();
int l=chars.position();
// Try to decode
buffer.flip();
final CoderResult result = dec.decode(buffer, chars, i >= tab.length -1);
System.out.println(result);
if (result.isUnderflow() && chars.position() == l) {
// Underflow, prepare the buffer for more writing
buffer.position(pos);
}else{
if (buffer.position() == buffer.limit()){
//ByteBuffer decoded
buffer.clear();
buffer.position(0);
}else{
//a part of ByteBuffer is decoded. We keep only bytes which are not decoded
final byte[] b = buffer.array();
final int f = buffer.position();
final int g = buffer.limit() - buffer.position();
buffer.clear();
buffer.position(0);
buffer.put(b, f, g);
}
}
buffer.limit(buffer.capacity());
}
dec.flush(chars);
chars.flip();
System.out.println(chars.toString());
The method decode(ByteBuffer, CharBuffer, boolean)
returns a result, but you are ignoring the result. If print the result in your second code fragment:
for (int i = 0; i < tab.length; i++) {
ByteBuffer buffer = ByteBuffer.wrap(tab, i, 1);
System.out.println(dec.decode(buffer, chars, i == 2));
}
you'll see this output:
UNDERFLOW
MALFORMED[1]
MALFORMED[1]
a a
Apparently it does not work correctly if you start decoding in the middle of a character. The decoder expects that the first thing it reads is the start of a valid UTF-8 sequence.
edit - When the decoder reports UNDERFLOW
, it expects you to add more data to the input buffer and then try to call decode()
again, but you must re-offer it the data from the start of the UTF-8 sequence that you are trying to decode. You can't continue in the middle of an UTF-8 sequence.
Here is a version that works, adding one byte from tab
in every iteration of the loop:
final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
final CharBuffer chars = CharBuffer.allocate(3);
final byte[] tab = new byte[]{(byte) -30, (byte) -126, (byte) -84}; //char €
final ByteBuffer buffer = ByteBuffer.allocate(10);
for (int i = 0; i < tab.length; i++) {
// Add the next byte to the buffer
buffer.put(tab[i]);
// Remember the current position
final int pos = buffer.position();
// Try to decode
buffer.flip();
final CoderResult result = dec.decode(buffer, chars, i == 2);
System.out.println(result);
if (result.isUnderflow()) {
// Underflow, prepare the buffer for more writing
buffer.limit(buffer.capacity());
buffer.position(pos);
}
}
dec.flush(chars);
chars.flip();
System.out.println("a" + chars.toString() + "a");
The decoder does not internally cache the data from partial characters, but this does not mean that you have to do complicated things to figure out what data to re-feed the decoder. You gave it a clear way to represent what data it actually consumed, i.e. the input ByteBuffer and its position. In the second example, by giving it a new ByteBuffer every time, the OP failed to pass the decoder back what it reported it had not yet consumed.
The standard pattern for using NIO Buffers is input, flip, output, compact, loop. Short of optimization (which may be premature), there is no reason to re-implement compact yourself. You might just get it wrong, like @Jesper and @lecogiteur did (if more than a single character was ever presented). You should NOT be resetting to the position from before the decode call.
The second example should have read something like:
final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
final CharBuffer chars = CharBuffer.allocate(3);
final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
final ByteBuffer buffer = ByteBuffer.wrap(new byte[3]);
for (int i=0; i<tab.length; i++){
b.put(tab, i, 1); // In actual usage some type of IO read/transfer would occur here
b.flip();
dec.decode(buffer, chars, i == 2);
b.compact();
}
dec.flush(chars);
System.out.println("a" + chars.toString() + "a");
NOTE: The above does not check the return value to detect malformed input or other error handling for running safely on arbitrary input/IO conditions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With