Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a huge Zip file in java - Out of Memory Error

Tags:

java

zip

I am reading a ZIP file using java as below:

Enumeration<? extends ZipEntry> zes=zip.entries();
    while(zes.hasMoreElements()) {
        ZipEntry ze=zes.nextElement();
        // do stuff..
    }

I am getting an out of memory error, the zip file size is about 160MB. The stacktrace is as below:

Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$1.<init>(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:197)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.zipFilePass2(DatToInsertDBBatch.java:250)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.processCompany(DatToInsertDBBatch.java:206)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.run(DatToInsertDBBatch.java:114)
at java.util.TimerThread.mainLoop(Timer.java:534)
at java.util.TimerThread.run(Timer.java:484)

How do I enumerate the contents of a big zip file without having increase my heap size? Also when I dont enumerate the contents and just access a single file like this:

ZipFile zip=new ZipFile(zipFile);
ZipEntry ze=zip.getEntry("docxml.xml");

Then I dont get an out of memory error. Why does this happen? How does a Zip file handle zip entries? The other option would be to use a ZIPInputStream. Would that have a small memory footprint. I would need to run this code eventually on a micro EC2 instance on the Amazon cloud (613 MB RAM)

EDIT: providing more information on how I process the zip entries after I get them

Enumeration<? extends ZipEntry> zes=zip.entries();
    while(zes.hasMoreElements()) {
        ZipEntry ze=zes.nextElement();
        S3Object s3Object=new S3Object(bkp.getCompanyFolder()+map.get(ze.getName()).getRelativeLoc());
            s3Object.setDataInputStream(zip.getInputStream(ze));
            s3Object.setStorageClass(S3Object.STORAGE_CLASS_REDUCED_REDUNDANCY);
            s3Object.addMetadata("x-amz-server-side-encryption", "AES256");
            s3Object.setContentType(Mimetypes.getInstance().getMimetype(s3Object.getKey()));
            s3Object.setContentDisposition("attachment; filename="+FilenameUtils.getName(s3Object.getKey()));
            s3objs.add(s3Object);
    }

I get the zipinputstream from the zipentry and store that in the S3object. I collect all the S3Objects in a list and then finally upload them to Amazon S3. For those who dont know Amazon S3, its a file storage service. You upload the file via HTTP.

I am thinking maybe since i collect all the individual inputstreams this is happening? Would it help if I batched it up? Like a 100 inputstreams at a time? Or would it be better if I unzipped it first and then used the unzipped file to upload rather storing streams?

like image 591
sethu Avatar asked Oct 10 '22 03:10

sethu


1 Answers

It is very unlikley that you get an out of memory exception because of processing a ZIP file. The Java classes ZipFile and ZipEntry don't contain anything that could possibly fill up 613 MB of memory.

What could exhaust your memory is to keep the decompressed files of the ZIP archive in memory, or - even worse - keeping them as an XML DOM, which is very memory intensive.

Switching to another ZIP library will hardly help. Instead, you should look into changing your code so that it processes the ZIP archive and the contained files like streams and only keeps a limited part of each file in memory at a time.

BTW: I would be nice if you could provide more information about the huge ZIP files (do they contain many small files or few large files?) and about what you do with each ZIP entry.

Update:

Thanks for the additional information. It looks like you keep the contents of the ZIP file in memory (although it somewhat depends on the implementation of the S3Object class, which I don't know).

It's probably best to implement some sort of batching as you propose yourself. You could for example add up the decompressed size of each ZIP entry and upload the files every time the total size exceeds 100 MB.

like image 96
Codo Avatar answered Oct 13 '22 11:10

Codo