Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unzip huge gz file in Java and performance

I am unzipping a huge gz file in java, the gz file is about 2 gb and the unzipped file is about 6 gb. from time to time it the unzipping process would take forever(hours), sometimes it finishes in reasonable time(like under 10 min or quicker).
I have a fairly powerful box(8GB ram, 4-cpu), is there a way to improve the code below? or use a completely different library?
Also I used Xms256m and Xmx4g to the vm.

public static File unzipGZ(File file, File outputDir) {
    GZIPInputStream in = null;
    OutputStream out = null;
    File target = null;
    try {
        // Open the compressed file
        in = new GZIPInputStream(new FileInputStream(file));

        // Open the output file
        target = new File(outputDir, FileUtil.stripFileExt(file.getName()));
        out = new FileOutputStream(target);

        // Transfer bytes from the compressed file to the output file
        byte[] buf = new byte[1024];
        int len;
        while ((len = in.read(buf)) > 0) {
            out.write(buf, 0, len);
        }

        // Close the file and stream
        in.close();
        out.close();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (in != null) {
            try {
                in.close();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
        if (out != null) {
            try {
                out.close();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
    }
    return target;
}
like image 752
user121196 Avatar asked Dec 08 '25 12:12

user121196


1 Answers

I don't know how much buffering is applied by default, if any - but you might want to try wrapping both the input and output in a BufferedInputStream / BufferedOutputStream. You could also try increasing your buffer size - 1K is a pretty small buffer. Experiment with different sizes, e.g. 16K, 64K etc. These should make the use of BufferedInputStream rather less important, of course.

On the other hand, I suspect this isn't really the problem. If it sometimes finishes in 10 minutes and sometimes takes hours, that suggests something very odd is going on. When it takes a very long time, is it actually making progress? Is the output file increasing in size? Is it using significant CPU? Is the disk constantly in use?

One side note: as you're closing in and out in finally blocks, you don't need to do it in the try block as well.

like image 52
Jon Skeet Avatar answered Dec 11 '25 20:12

Jon Skeet



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!