Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use threading in Python to parallelize AWS S3 API calls?

I wrote a Python script that seeks to determine the total size of a all available AWS S3 buckets by making use of the AWS Boto 3 list_objects() method.

The logic is simple:

  1. Get the initial list of objects from each S3 bucket (automatically truncated after 1,000 objects)
  2. Iterate through each object in the list of objects, adding the size of that object to a total_size variable
  3. While the bucket still has additional objects, retrieve them and repeat step 2

Here's the relevant code snippet:

import boto3

s3_client = boto3.client('s3')

# Get all S3 buckets owned by the authenticated sender of the request
buckets = s3_client.list_buckets()

# For each bucket...
for bucket in buckets['Buckets']:
    # Get up to first 1,000 objects in bucket
    bucket_objects = s3_client.list_objects(Bucket=bucket['Name'])

    # Initialize total_size
    total_size = 0

    # Add size of each individual item in bucket to total size
    for obj in bucket_objects['Contents']:
        total_size += obj['Size']

    # Get additional objects from bucket, if more
    while bucket_objects['IsTruncated']:
        # Get next 1,000 objects, starting after final object of current list
        bucket_objects = s3_client.list_objects(
            Bucket=bucket['Name'],
            Marker=bucket_objects['Contents'][-1]['Key'])
        for obj in bucket_objects['Contents']:
            total_size += obj['Size']

    size_in_MB = total_size/1000000.0
    print('Total size of objects in bucket %s: %.2f MB'
        % (bucket['Name'], size_in_MB))

This code runs relatively quickly on buckets that have less than 5 MB or so of data in them, however when I hit a bucket that has 90+ MB of data in it, execution jumps up from milliseconds to 20-30+ seconds.

My hope was to use the threading module to parallelize the I/O portion of the code (getting the list of objects from S3) so that the total size of all objects in the bucket could be added as soon as the thread retrieving them completed rather than having to do that retrieval and addition sequentially.

I understand that Python doesn't support true multithreading because of the GIL, just to avoid getting responses to that effect, but my understanding is that since this is an I/O operation as opposed to a CPU-intensive operation, the threading module should be able to improve the run time.

The main difference between my problem and the several examples I've seen on here of threading implementations is that I'm not iterating over a known list or set. Here I must first retrieve a list of objects, see if the list is truncated, and then retrieve the next list of objects based off of the final object's key in the current list.

Can anyone explain a way to improve the run time of this code, or is it not possible in this situation?

like image 671
Mark Avatar asked May 09 '16 23:05

Mark


People also ask

Does Python allow threading?

Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.

How do I link my Amazon S3 to Python?

Set Up Credentials To Connect Python To S3Sign in to the management console. Search for and pull up the S3 homepage. Next, create a bucket. Give it a unique name, choose a region close to you, and keep the other default settings in place (or change them as you see fit).

How do you use multithreading in Python?

To use multithreading, we need to import the threading module in Python Program. A start() method is used to initiate the activity of a thread. And it calls only once for each thread so that the execution of the thread can begin.


2 Answers

I ran into similar problems.

It seems to be important to create a separate session for each thread.

So instead of

s3_client = boto3.client('s3')

you need to write

s3_client = boto3.session.Session().client('s3')

otherwise threads interfere with each other, and random errors occur.

Beyond that the normal issues of multithreading apply.

My project is upload 135,000 files to an S3 bucket. So far I have found that I get the best performance with 8 threads. What would otherwise take 3.6 hours, takes 1.25 hours.

like image 67
Peter Dobson Avatar answered Nov 15 '22 03:11

Peter Dobson


I have a solution which may not work in all cases but can cover a good deal of scenarios. If you have objects organised hierarchically in subfolders, then first only list subfolders using mechanism described in this post

Then using these obtained set of prefixes submit them to a multiprocessing pool (or Thread Pool) where each worker will fetch all keys specific to one prefix and collect them in a shared container using multiprocessing Manager. In this way keys will be fetched in parallel.

Above solution will perform best if keys are distributed evenly and hierarchically and worst if data is organized flat.

like image 25
Pranav Gupta Avatar answered Nov 15 '22 03:11

Pranav Gupta