How can I use threading in Python to parallelize AWS S3 API calls?

Tags:

I wrote a Python script that seeks to determine the total size of a all available AWS S3 buckets by making use of the AWS Boto 3 list_objects() method.

The logic is simple:

Get the initial list of objects from each S3 bucket (automatically truncated after 1,000 objects)
Iterate through each object in the list of objects, adding the size of that object to a total_size variable
While the bucket still has additional objects, retrieve them and repeat step 2

Here's the relevant code snippet:

import boto3

s3_client = boto3.client('s3')

# Get all S3 buckets owned by the authenticated sender of the request
buckets = s3_client.list_buckets()

# For each bucket...
for bucket in buckets['Buckets']:
    # Get up to first 1,000 objects in bucket
    bucket_objects = s3_client.list_objects(Bucket=bucket['Name'])

    # Initialize total_size
    total_size = 0

    # Add size of each individual item in bucket to total size
    for obj in bucket_objects['Contents']:
        total_size += obj['Size']

    # Get additional objects from bucket, if more
    while bucket_objects['IsTruncated']:
        # Get next 1,000 objects, starting after final object of current list
        bucket_objects = s3_client.list_objects(
            Bucket=bucket['Name'],
            Marker=bucket_objects['Contents'][-1]['Key'])
        for obj in bucket_objects['Contents']:
            total_size += obj['Size']

    size_in_MB = total_size/1000000.0
    print('Total size of objects in bucket %s: %.2f MB'
        % (bucket['Name'], size_in_MB))

This code runs relatively quickly on buckets that have less than 5 MB or so of data in them, however when I hit a bucket that has 90+ MB of data in it, execution jumps up from milliseconds to 20-30+ seconds.

My hope was to use the threading module to parallelize the I/O portion of the code (getting the list of objects from S3) so that the total size of all objects in the bucket could be added as soon as the thread retrieving them completed rather than having to do that retrieval and addition sequentially.

I understand that Python doesn't support true multithreading because of the GIL, just to avoid getting responses to that effect, but my understanding is that since this is an I/O operation as opposed to a CPU-intensive operation, the threading module should be able to improve the run time.

The main difference between my problem and the several examples I've seen on here of threading implementations is that I'm not iterating over a known list or set. Here I must first retrieve a list of objects, see if the list is truncated, and then retrieve the next list of objects based off of the final object's key in the current list.

Can anyone explain a way to improve the run time of this code, or is it not possible in this situation?

671

asked May 09 '16 23:05

Mark

2 Answers

I ran into similar problems.

It seems to be important to create a separate session for each thread.

So instead of

s3_client = boto3.client('s3')

you need to write

s3_client = boto3.session.Session().client('s3')

otherwise threads interfere with each other, and random errors occur.

Beyond that the normal issues of multithreading apply.

My project is upload 135,000 files to an S3 bucket. So far I have found that I get the best performance with 8 threads. What would otherwise take 3.6 hours, takes 1.25 hours.

answered Nov 15 '22 03:11

Peter Dobson

I have a solution which may not work in all cases but can cover a good deal of scenarios. If you have objects organised hierarchically in subfolders, then first only list subfolders using mechanism described in this post

Then using these obtained set of prefixes submit them to a multiprocessing pool (or Thread Pool) where each worker will fetch all keys specific to one prefix and collect them in a shared container using multiprocessing Manager. In this way keys will be fetched in parallel.

Above solution will perform best if keys are distributed evenly and hierarchically and worst if data is organized flat.

answered Nov 15 '22 03:11

Pranav Gupta

Related questions
                            
                                In Python, how to replace all non-UTF-8 characters in a string?
                            
                                Django admin is_staff based on group
                            
                                Polynomial function cannot be solved by Python sympy
                            
                                Is recursion worse than iteration? [closed]
                            
                                How to find row of 2d array in 3d numpy array
                            
                                What range function does to a Python list?
                            
                                Variable not available at Spyder's Variable Explorer when naming it with upper case only
                            
                                insert list of dict into MySQL using Python [closed]
                            
                                python filtering list of dict by dict containing several key-value pairs as conditions
                            
                                Passing an input to a service and saving the result to DB in Django
                            
                                Why is this expression always true when removing "any"?
                            
                                Python - Apply member function in a map
                            
                                (fake_useragent) UserAgent() will not connect
                            
                                Majority Element Python
                            
                                Get non-duplicate rows from numpy array
                            
                                How to properly numref table in Sphinx?
                            
                                Avoiding infinite recursion with os.walk
                            
                                How to calculate the inverse of the log normal cumulative distribution function in python?
                            
                                which python neo4j drivers are stable/production ready?
                            
                                Can i press two keys simultaneously for a single event using Pygame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I use threading in Python to parallelize AWS S3 API calls?

Tags:

python

multithreading

concurrency

amazon-web-services

amazon-s3

Mark

People also ask

2 Answers

Peter Dobson

Pranav Gupta

Recent Activity

Donate For Us