Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is CharsetDecoder.decode(ByteBuffer, CharBuffer, endOfInput)

I have a problem with CharsetDecoder class.

First example of code (which works):

    final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
    final ByteBuffer b = ByteBuffer.allocate(3);
    final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
    for (int i=0; i<tab.length; i++){
        b.put(tab, i, 1);
    }
    try {
        b.flip();
        System.out.println("a" + dec.decode(b).toString() + "a");
    } catch (CharacterCodingException e1) {
        e1.printStackTrace();
    }

The result is a€a

But when i execute this code:

    final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
    final CharBuffer chars = CharBuffer.allocate(3);
    final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
    for (int i=0; i<tab.length; i++){
        ByteBuffer buffer = ByteBuffer.wrap(tab, i, 1);
        dec.decode(buffer, chars, i == 2);
    }
    dec.flush(chars);
    System.out.println("a" + chars.toString() + "a");

The result is a

Why is not the same result?

How to use the method decode(ByteBuffer, CharBuffer, endOfInput) of class CharsetDecoder in order to retrieve the result a€a ?

-- EDIT --

So with code of Jesper I do that. It's no perfect but works with a step = 1, 2 and 3

final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
    final CharBuffer chars = CharBuffer.allocate(6);
    final byte[] tab = new byte[]{(byte)97, (byte)-30, (byte)-126, (byte)-84, (byte)97, (byte)97}; //char €

    final ByteBuffer buffer = ByteBuffer.allocate(10);

    final int step = 3;
    for (int i = 0; i < tab.length; i++) {
        // Add the next byte to the buffer
        buffer.put(tab, i, step);
        i+=step-1;

        // Remember the current position
        final int pos = buffer.position();
        int l=chars.position();

        // Try to decode
        buffer.flip();
        final CoderResult result = dec.decode(buffer, chars, i >= tab.length -1);
        System.out.println(result);

        if (result.isUnderflow() && chars.position() == l) {
            // Underflow, prepare the buffer for more writing
            buffer.position(pos);
        }else{
            if (buffer.position() == buffer.limit()){
                //ByteBuffer decoded
                buffer.clear();
                buffer.position(0);
            }else{
                //a part of ByteBuffer is decoded. We keep only bytes which are not decoded
                final byte[] b = buffer.array();
                final int f = buffer.position();
                final int g = buffer.limit() - buffer.position();
                buffer.clear();
                buffer.position(0);
                buffer.put(b, f, g);
            }
        }
        buffer.limit(buffer.capacity());
    }

    dec.flush(chars);
    chars.flip();

    System.out.println(chars.toString());
like image 396
lecogiteur Avatar asked Apr 10 '15 11:04

lecogiteur


2 Answers

The method decode(ByteBuffer, CharBuffer, boolean) returns a result, but you are ignoring the result. If print the result in your second code fragment:

for (int i = 0; i < tab.length; i++) {
    ByteBuffer buffer = ByteBuffer.wrap(tab, i, 1);
    System.out.println(dec.decode(buffer, chars, i == 2));
}

you'll see this output:

UNDERFLOW
MALFORMED[1]
MALFORMED[1]
a   a

Apparently it does not work correctly if you start decoding in the middle of a character. The decoder expects that the first thing it reads is the start of a valid UTF-8 sequence.

edit - When the decoder reports UNDERFLOW, it expects you to add more data to the input buffer and then try to call decode() again, but you must re-offer it the data from the start of the UTF-8 sequence that you are trying to decode. You can't continue in the middle of an UTF-8 sequence.

Here is a version that works, adding one byte from tab in every iteration of the loop:

final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
final CharBuffer chars = CharBuffer.allocate(3);
final byte[] tab = new byte[]{(byte) -30, (byte) -126, (byte) -84}; //char €

final ByteBuffer buffer = ByteBuffer.allocate(10);

for (int i = 0; i < tab.length; i++) {
    // Add the next byte to the buffer
    buffer.put(tab[i]);

    // Remember the current position
    final int pos = buffer.position();

    // Try to decode
    buffer.flip();
    final CoderResult result = dec.decode(buffer, chars, i == 2);
    System.out.println(result);

    if (result.isUnderflow()) {
        // Underflow, prepare the buffer for more writing
        buffer.limit(buffer.capacity());
        buffer.position(pos);
    }
}

dec.flush(chars);
chars.flip();

System.out.println("a" + chars.toString() + "a");
like image 69
Jesper Avatar answered Nov 03 '22 18:11

Jesper


The decoder does not internally cache the data from partial characters, but this does not mean that you have to do complicated things to figure out what data to re-feed the decoder. You gave it a clear way to represent what data it actually consumed, i.e. the input ByteBuffer and its position. In the second example, by giving it a new ByteBuffer every time, the OP failed to pass the decoder back what it reported it had not yet consumed.

The standard pattern for using NIO Buffers is input, flip, output, compact, loop. Short of optimization (which may be premature), there is no reason to re-implement compact yourself. You might just get it wrong, like @Jesper and @lecogiteur did (if more than a single character was ever presented). You should NOT be resetting to the position from before the decode call.

The second example should have read something like:

    final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
    final CharBuffer chars = CharBuffer.allocate(3);
    final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
    final ByteBuffer buffer = ByteBuffer.wrap(new byte[3]);

    for (int i=0; i<tab.length; i++){
        b.put(tab, i, 1);  // In actual usage some type of IO read/transfer would occur here
        b.flip();
        dec.decode(buffer, chars, i == 2);
        b.compact();
    }
    dec.flush(chars);
    System.out.println("a" + chars.toString() + "a");

NOTE: The above does not check the return value to detect malformed input or other error handling for running safely on arbitrary input/IO conditions.

like image 21
SensorSmith Avatar answered Nov 03 '22 20:11

SensorSmith