Python S3 download zip file

Question

I have zip files uploaded to S3. I'd like to download them for processing. I don't need to permanently store them, but I need to temporarily process them. How would I go about doing this?

brice · Accepted Answer

Because working software > comprehensive documentation:

Boto2

import zipfile
import boto
import io

# Connect to s3
# This will need your s3 credentials to be set up 
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()

# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")

# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"

# Create an in-memory bytes IO buffer
with io.BytesIO() as b:

    # Read the file into it
    key.get_file(b)

    # Reset the file pointer to the beginning
    b.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            do_stuff_with_subfile()

Boto3

import zipfile
import boto3
import io

# this is just to demo. real use should use the config 
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html

session = boto3.session.Session(
    aws_access_key_id="ACCESSKEY", 
    aws_secret_access_key="SECRETKEY"
)

s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')

with io.BytesIO(obj.get()["Body"].read()) as tf:

    # rewind the file
    tf.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Tested on MacOSX with Python3.

DanGar · Answer

If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.

This will allow for a latency reduction and allow you to process them fairly efficiently. You can remove each extracted file after finishing your work.

Note: This will only work if you are fine using EC2 instances.

Teddy Ward · Answer

Pandas provides a shortcut for this, which removes most of the code from the top answer, and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine.

import pandas as pd  

obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
    # Use your byte stream, to, for example, print file names...
    with zipfile.ZipFile(byte_stream, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Python S3 download zip file

Tags:

python

zip

amazon-s3

user1802143

3 Answers

Boto2

Boto3

brice

DanGar

Teddy Ward

Recent Activity

Donate For Us

Python S3 download zip file

Tags:

python

zip

amazon-s3

user1802143

3 Answers

Boto2

Boto3

brice

DanGar

Teddy Ward

Related questions

Recent Activity

Donate For Us