Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading zip file efficiently in Java

I working on a project which works on a very large amount of data. I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines). What I am currently doing is the following:

for(File zipFile: dir.listFiles()){
ZipFile zf = new ZipFile(zipFile);
ZipEntry ze = (ZipEntry) zf.entries().nextElement();
BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));
...

In this way I can read the file line by line, but it is definetely too slow. Given the large number of files and lines that need to be read, I need to read them in a more efficient way.

I have looked for a different approach, but I haven't been able to find anything. What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don't know how to use them with zip files.

Any help would really be appreciated.

Thanks,

Marco

like image 595
smellyarmpits Avatar asked May 24 '12 15:05

smellyarmpits


2 Answers

I have a lot(thousands) of zip files. The zipped files are about 30MB each, while the txt inside the zip file is about 60/70 MB. Reading and processing the files with this code takes a lot of hours, around 15, but it depends.

Let's do some back-of-the-envelope calculations.

Let's say you have 5000 files. If it takes 15 hours to process them, this equates to ~10 seconds per file. The files are about 30MB each, so the throughput is ~3MB/s.

This is between one and two orders of magnitude slower than the rate at which ZipFile can decompress stuff.

Either there's a problem with the disks (are they local, or a network share?), or it is the actual processing that is taking most of the time.

The best way to find out for sure is by using a profiler.

like image 57
NPE Avatar answered Oct 11 '22 01:10

NPE


The right way to iterate a zip file

final ZipFile file = new ZipFile( FILE_NAME );
try
{
    final Enumeration<? extends ZipEntry> entries = file.entries();
    while ( entries.hasMoreElements() )
    {
        final ZipEntry entry = entries.nextElement();
        System.out.println( entry.getName() );
        //use entry input stream:
        readInputStream( file.getInputStream( entry ) )
    }
}
finally
{
    file.close();
}

private static int readInputStream( final InputStream is ) throws IOException {
    final byte[] buf = new byte[ 8192 ];
    int read = 0;
    int cntRead;
    while ( ( cntRead = is.read( buf, 0, buf.length ) ) >=0  )
    {
        read += cntRead;
    }
    return read;
}

Zip file consists of several entries, each of them has a field containing the number of bytes in the current entry. So, it is easy to iterate all zip file entries without actual data decompression. java.util.zip.ZipFile accepts a file/file name and uses random access to jump between file positions. java.util.zip.ZipInputStream, on the other hand, is working with streams, so it is unable to freely jump. That’s why it has to read and decompress all zip data in order to reach EOF for each entry and read the next entry header.

What does it mean? If you already have a zip file in your file system – use ZipFile to process it regardless of your task. As a bonus, you can access zip entries either sequentially or randomly (with rather small performance penalty). On the other hand, if you are processing a stream, you’ll need to process all entries sequentially using ZipInputStream.

Here is an example. A zip archive (total file size = 1.6Gb) containing three 0.6Gb entries was iterated in 0.05 sec using ZipFile and in 18 sec using ZipInputStream.

like image 35
Wasim Wani Avatar answered Oct 11 '22 02:10

Wasim Wani