Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GZIP in Matlab for big files

I have a function that unpacks a byte array Z that was packaged using the zlib library (adapted from here).

  • The packed data size is 4.11 GB, and the unpacked data will be 6.65GB. I have 32GB of memory, so this is well below the limit.
  • I tried increasing the java heap size to 15.96GB but that didn't help.
  • The MATLAB_JAVA environment variable points to jre1.8.0_144.

I get the cryptic error

'MATLAB array exceeds an internal Java limit.' 

at the 2nd line of this code:

import com.mathworks.mlwidgets.io.InterruptibleStreamCopier
a=java.io.ByteArrayInputStream(Z);
b=java.util.zip.GZIPInputStream(a);
isc = InterruptibleStreamCopier.getInterruptibleStreamCopier;
c = java.io.ByteArrayOutputStream;
isc.copyStream(b,c);
M=typecast(c.toByteArray,'uint8');

Attempting to implement Mark Adler's suggestion:

Z=reshape(Z,[],8);
import com.mathworks.mlwidgets.io.InterruptibleStreamCopier
a=java.io.ByteArrayInputStream(Z(:,1));
b=java.util.zip.GZIPInputStream(a);
for ct = 2:8,b.read(Z(:,ct));end
isc = InterruptibleStreamCopier.getInterruptibleStreamCopier;
c = java.io.ByteArrayOutputStream;
isc.copyStream(b,c);

But at this isc.copystream I get this error:

Java exception occurred:
java.io.EOFException: Unexpected end of ZLIB input stream

    at java.util.zip.InflaterInputStream.fill(Unknown Source)

    at java.util.zip.InflaterInputStream.read(Unknown Source)

    at java.util.zip.GZIPInputStream.read(Unknown Source)

    at java.io.FilterInputStream.read(Unknown Source)

    at com.mathworks.mlwidgets.io.InterruptibleStreamCopier.copyStream(InterruptibleStreamCopier.java:72)

    at com.mathworks.mlwidgets.io.InterruptibleStreamCopier.copyStream(InterruptibleStreamCopier.java:51)

Reading directly from file I tried to read the data directly from a file.

streamCopier = com.mathworks.mlwidgets.io.InterruptibleStreamCopier.getInterruptibleStreamCopier;
fileInStream = java.io.FileInputStream(java.io.File(filename));
fileInStream.skip(datastart);
gzipInStream = java.util.zip.GZIPInputStream( fileInStream );
baos = java.io.ByteArrayOutputStream;
streamCopier.copyStream(gzipInStream,baos);
data = baos.toByteArray;
baos.close;
gzipInStream.close;
fileInStream.close;

Works fine for small files, but with big files I get:

Java exception occurred:
java.lang.OutOfMemoryError

at the line streamCopier.copyStream(gzipInStream,baos);

like image 382
Gelliant Avatar asked Jan 29 '23 09:01

Gelliant


1 Answers

The bottleneck seems to be the size of each individual Java object being created. This happens in java.io.ByteArrayInputStream(Z) since a MATLAB array cannot be fed into Java w/o conversion, and also in copyStream, where data is actually copied into the the output buffer/memory. I had a similar idea for splitting objects into chunks of allowable size (src):

function chunkDunzip(Z)
%% Imports:
import com.mathworks.mlwidgets.io.InterruptibleStreamCopier
%% Definitions:
MAX_CHUNK = 100*1024*1024; % 100 MB, just an example
%% Split to chunks:
nChunks = ceil(numel(Z)/MAX_CHUNK);
chunkBounds = round(linspace(0, numel(Z), max(2,nChunks)) );

V = java.util.Vector();
for indC = 1:numel(chunkBounds)-1
  V.add(java.io.ByteArrayInputStream(Z(chunkBounds(indC)+1:chunkBounds(indC+1))));
end

S = java.io.SequenceInputStream(V.elements);  
b = java.util.zip.InflaterInputStream(S);

isc = InterruptibleStreamCopier.getInterruptibleStreamCopier;
c = java.io.FileOutputStream(java.io.File('D:\outFile.bin'));
isc.copyStream(b,c);
c.close();

end

Several notes:

  1. I used a FileOutputStream since it doesn't run into the internal limit of Java objects (as far as my tests went).
  2. Increasing the Java heap memory is still required.
  3. I demonstrated it using the deflate, and not gzip. The solution for gzip is very similar - if this is a problem, I'll modify it.
like image 58
Dev-iL Avatar answered Jan 31 '23 23:01

Dev-iL