Streaming decompression of zip archives in python

Question

Is there a way to do streaming decompression of single-file zip archives?

I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.

A simple example:

import boto

def count_newlines(bucket_name, key_name):
    conn = boto.connect_s3()
    b = conn.get_bucket(bucket_name)
    # key is a .zip file
    key = b.get_key(key_name)

    count = 0
    for chunk in key:
        # How should decompress happen?
        count += decompress(chunk).count('
')

    return count

This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile module, as it seems to require random access to the entire file being unzipped.

Mark Adler · Accepted Answer

Yes, but you'll likely have to write your own code to do it if it has to be in Python. You can look at sunzip for an example in C for how to unzip a zip file from a stream. sunzip creates temporary files as it decompresses the zip entries, and then moves those files and sets their attributes appropriately upon reading the central directory at the end. Claims that you must be able to seek to the central directory in order to properly unzip a zip file are incorrect.

Streaming decompression of zip archives in python

Tags:

python

compression

zip

amazon-s3

Rahul Gupta-Iwasaki

1 Answers

Mark Adler

Recent Activity

Donate For Us

Streaming decompression of zip archives in python

Tags:

python

compression

zip

amazon-s3

Rahul Gupta-Iwasaki

1 Answers

Mark Adler

Related questions

Recent Activity

Donate For Us