Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python S3 download zip file

I have zip files uploaded to S3. I'd like to download them for processing. I don't need to permanently store them, but I need to temporarily process them. How would I go about doing this?

like image 975
user1802143 Avatar asked Apr 29 '14 23:04

user1802143


3 Answers

Because working software > comprehensive documentation:

Boto2

import zipfile
import boto
import io

# Connect to s3
# This will need your s3 credentials to be set up 
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()

# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")

# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"

# Create an in-memory bytes IO buffer
with io.BytesIO() as b:

    # Read the file into it
    key.get_file(b)

    # Reset the file pointer to the beginning
    b.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            do_stuff_with_subfile()

Boto3

import zipfile
import boto3
import io

# this is just to demo. real use should use the config 
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html

session = boto3.session.Session(
    aws_access_key_id="ACCESSKEY", 
    aws_secret_access_key="SECRETKEY"
)

s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')

with io.BytesIO(obj.get()["Body"].read()) as tf:

    # rewind the file
    tf.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Tested on MacOSX with Python3.

like image 93
brice Avatar answered Oct 19 '22 02:10

brice


If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.

This will allow for a latency reduction and allow you to process them fairly efficiently. You can remove each extracted file after finishing your work.

Note: This will only work if you are fine using EC2 instances.

like image 20
DanGar Avatar answered Oct 19 '22 02:10

DanGar


Pandas provides a shortcut for this, which removes most of the code from the top answer, and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine.

import pandas as pd  

obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
    # Use your byte stream, to, for example, print file names...
    with zipfile.ZipFile(byte_stream, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)
like image 22
Teddy Ward Avatar answered Oct 19 '22 04:10

Teddy Ward