Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

InputStreamReader buffering issue

I am reading data from a file that has, unfortunately, two types of character encoding.

There is a header and a body. The header is always in ASCII and defines the character set that the body is encoded in.

The header is not fixed length and must be run through a parser to determine its content/length.

The file may also be quite large so I need to avoid bring the entire content into memory.

So I started off with a single InputStream. I wrap it initially with an InputStreamReader with ASCII and decode the header and extract the character set for the body. All good.

Then I create a new InputStreamReader with the correct character set, drop it over the same InputStream and start trying to read the body.

Unfortunately it appears, javadoc confirms this, that InputStreamReader may choose to read-ahead for effeciency purposes. So the reading of the header chews some/all of the body.

Does anyone have any suggestions for working round this issue? Would creating a CharsetDecoder manually and feeding in one byte at a time but a good idea (possibly wrapped in a custom Reader implementation?)

Thanks in advance.

EDIT: My final solution was to write a InputStreamReader that has no buffering to ensure I can parse the header without chewing part of the body. Although this is not terribly efficient I wrap the raw InputStream with a BufferedInputStream so it won't be an issue.

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}
like image 408
Mike Q Avatar asked Apr 13 '10 16:04

Mike Q


People also ask

Do I need to close InputStreamReader?

It's important to close any resource that you use. in. close will close BufferedReader, which in turn closes the resources that it itself uses ie. the InputStreamReader.

Does InputStreamReader close underlying stream?

InputStream is just an "interface" in terms of close() . InputStreamReader will not close an interface. It will close the underlying data resource (like file descriptor) if it is. It will do nothing if close is override and empty in an implementation.

What is the function of InputStreamReader?

An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset . The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.

What is the difference between InputStream and InputStreamReader?

An InputStream is typically always connected to some data source, like a file, network connection, pipe etc. This is also explained in more detail in the Java IO Overview text. InputStreamReader takes an inputstream and converts the bytes Strem into characters when you are reading it.


2 Answers

Why don't you use 2 InputStreams? One for reading the header and another for the body.

The second InputStream should skip the header bytes.

like image 194
bruno conde Avatar answered Sep 21 '22 00:09

bruno conde


Here is the pseudo code.

  1. Use InputStream, but do not wrap a Reader around it.
  2. Read bytes containing header and store them into ByteArrayOutputStream.
  3. Create ByteArrayInputStream from ByteArrayOutputStream and decode header, this time wrap ByteArrayInputStream into Reader with ASCII charset.
  4. Compute the length of non-ascii input, and read that number of bytes into another ByteArrayOutputStream.
  5. Create another ByteArrayInputStream from the second ByteArrayOutputStream and wrap it with Reader with charset from the header.
like image 24
Alexander Pogrebnyak Avatar answered Sep 20 '22 00:09

Alexander Pogrebnyak