Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream decoding of Base64 data

I have some large base64 encoded data (stored in snappy files in the hadoop filesystem). This data was originally gzipped text data. I need to be able to read chunks of this encoded data, decode it, and then flush it to a GZIPOutputStream.

Any ideas on how I could do this instead of loading the whole base64 data into an array and calling Base64.decodeBase64(byte[]) ?

Am I right if I read the characters till the '\r\n' delimiter and decode it line by line? e.g. :

for (int i = 0; i < byteData.length; i++) {
    if (byteData[i] == CARRIAGE_RETURN || byteData[i] == NEWLINE) {
       if (i < byteData.length - 1 && byteData[i + 1] == NEWLINE)
            i += 2;
       else 
            i += 1;

       byteBuffer.put(Base64.decodeBase64(record));

       byteCounter = 0;
       record = new byte[8192];
    } else {
        record[byteCounter++] = byteData[i];
    }
}

Sadly, this approach doesn't give any human readable output. Ideally, I would like to stream read, decode, and stream out the data.

Right now, I'm trying to put in an inputstream and then copy to a gzipout

byteBuffer.get(bufferBytes);

InputStream inputStream = new ByteArrayInputStream(bufferBytes);
inputStream = new GZIPInputStream(inputStream);
IOUtils.copy(inputStream , gzipOutputStream);

And it gives me a java.io.IOException: Corrupt GZIP trailer

like image 999
James Isaac Avatar asked Nov 14 '13 14:11

James Isaac


1 Answers

Let's go step by step:

  1. You need a GZIPInputStream to read zipped data (that and not a GZIPOutputStream; the output stream is used to compress data). Having this stream you will be able to read the uncompressed, original binary data. This requires an InputStream in the constructor.

  2. You need an input stream capable of reading the Base64 encoded data. I suggest the handy Base64InputStream from apache-commons-codec. With the constructor you can set the line length, the line separator and set doEncode=false to decode data. This in turn requires another input stream - the raw, Base64 encoded data.

  3. This stream depends on how you get your data; ideally the data should be available as InputStream - problem solved. If not, you may have to use the ByteArrayInputStream (if binary), StringBufferInputStream (if string) etc.

Roughly this logic is:

InputStream fromHadoop = ...;                                  // 3rd paragraph
Base64InputStream b64is =                                      // 2nd paragraph
    new Base64InputStream(fromHadoop, false, 80, "\n".getBytes("UTF-8"));
GZIPInputStream zis = new GZIPInputStream(b64is);              // 1st paragraph

Please pay attention to the arguments of Base64InputStream (line length and end-of-line byte array), you may need to tweak them.

like image 60
Nikos Paraskevopoulos Avatar answered Oct 04 '22 15:10

Nikos Paraskevopoulos