Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming decompression of zip archives in python

Is there a way to do streaming decompression of single-file zip archives?

I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.

A simple example:

import boto

def count_newlines(bucket_name, key_name):
    conn = boto.connect_s3()
    b = conn.get_bucket(bucket_name)
    # key is a .zip file
    key = b.get_key(key_name)

    count = 0
    for chunk in key:
        # How should decompress happen?
        count += decompress(chunk).count('\n')

    return count

This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile module, as it seems to require random access to the entire file being unzipped.

like image 645
Rahul Gupta-Iwasaki Avatar asked Nov 09 '22 16:11

Rahul Gupta-Iwasaki


1 Answers

Yes, but you'll likely have to write your own code to do it if it has to be in Python. You can look at sunzip for an example in C for how to unzip a zip file from a stream. sunzip creates temporary files as it decompresses the zip entries, and then moves those files and sets their attributes appropriately upon reading the central directory at the end. Claims that you must be able to seek to the central directory in order to properly unzip a zip file are incorrect.

like image 137
Mark Adler Avatar answered Nov 14 '22 21:11

Mark Adler