Is there a way to do streaming decompression of single-file zip archives?
I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.
A simple example:
import boto
def count_newlines(bucket_name, key_name):
conn = boto.connect_s3()
b = conn.get_bucket(bucket_name)
# key is a .zip file
key = b.get_key(key_name)
count = 0
for chunk in key:
# How should decompress happen?
count += decompress(chunk).count('\n')
return count
This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile
module, as it seems to require random access to the entire file being unzipped.
Yes, but you'll likely have to write your own code to do it if it has to be in Python. You can look at sunzip for an example in C for how to unzip a zip file from a stream. sunzip creates temporary files as it decompresses the zip entries, and then moves those files and sets their attributes appropriately upon reading the central directory at the end. Claims that you must be able to seek to the central directory in order to properly unzip a zip file are incorrect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With