Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any faster way for downloading multiple files from s3 to local folder?

I am trying to download 12,000 files from s3 bucket using jupyter notebook, which is estimating to complete download in 21 hours. This is because each file is downloaded one at a time. Can we do multiple downloads parallel to each other so I can speed up the process?

Currently, I am using the following code to download all files

### Get unique full-resolution image basenames
images = df['full_resolution_image_basename'].unique()
print(f'No. of unique full-resolution images: {len(images)}')

### Create a folder for full-resolution images
images_dir = './images/'
os.makedirs(images_dir, exist_ok=True)

### Download images
images_str = "','".join(images)
limiting_clause = f"CONTAINS(ARRAY['{images_str}'], 
full_resolution_image_basename)"
_ = download_full_resolution_images(images_dir, 
limiting_clause=limiting_clause)
like image 918
Jothi Avatar asked Mar 10 '18 04:03

Jothi


People also ask

How do I download multiple files from S3 bucket to local?

Unfortunately AWS S3 Console doesn't have an option to download all the content of an S3 bucket at the moment, but there are other options like AWS CLI to do so. For instance: aws s3 sync s3://all_my_stuff_bucket . This command will start downloading all the objects in all_my_stuff_bucket to the current directory.

Can I download multiple files from S3?

Download multiple files from AWS CloudShell using Amazon S3 Now you need to download the contents of the bucket to your local machine. Because the Amazon S3 console doesn't support the downloading of multiple objects, you need to use the AWS CLI tool that's installed on your local machine.

Why is downloading from S3 so slow?

Large object size For very large Amazon S3 objects, you might notice slow download times as your web browser tries to download the entire object. Instead, try downloading large objects with a ranged GET request using the Amazon S3 API.

How fast is S3 copy?

S3-bucket-copying performance can exceed 8 gigabytes per second.


1 Answers

See the code below. This will only work with python 3.6+, because of the f-string (PEP 498). Use a different method of string formatting for older versions of python.

Provide the relative_path, bucket_name and s3_object_keys. In addition, max_workers is optional, and if not provided the number will be a multiple of 5 times the number of machine processors.

Most of the code for this answer came from an answer to How to create an async generator in Python? which sources from this example documented in the library.

import boto3
import os
from concurrent import futures


relative_path = './images'
bucket_name = 'bucket_name'
s3_object_keys = [] # List of S3 object keys
max_workers = 5

abs_path = os.path.abspath(relative_path)
s3 = boto3.client('s3')

def fetch(key):
    file = f'{abs_path}/{key}'
    os.makedirs(file, exist_ok=True)  
    with open(file, 'wb') as data:
        s3.download_fileobj(bucket_name, key, data)
    return file


def fetch_all(keys):

    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_key = {executor.submit(fetch, key): key for key in keys}

        print("All URLs submitted.")

        for future in futures.as_completed(future_to_key):

            key = future_to_key[future]
            exception = future.exception()

            if not exception:
                yield key, future.result()
            else:
                yield key, exception


for key, result in fetch_all(S3_OBJECT_KEYS):
    print(f'key: {key}  result: {result}')
like image 96
Diego Goding Avatar answered Nov 16 '22 01:11

Diego Goding