I wrote a Python script that seeks to determine the total size of a all available AWS S3 buckets by making use of the AWS Boto 3 list_objects() method.
The logic is simple:
Here's the relevant code snippet:
import boto3
s3_client = boto3.client('s3')
# Get all S3 buckets owned by the authenticated sender of the request
buckets = s3_client.list_buckets()
# For each bucket...
for bucket in buckets['Buckets']:
# Get up to first 1,000 objects in bucket
bucket_objects = s3_client.list_objects(Bucket=bucket['Name'])
# Initialize total_size
total_size = 0
# Add size of each individual item in bucket to total size
for obj in bucket_objects['Contents']:
total_size += obj['Size']
# Get additional objects from bucket, if more
while bucket_objects['IsTruncated']:
# Get next 1,000 objects, starting after final object of current list
bucket_objects = s3_client.list_objects(
Bucket=bucket['Name'],
Marker=bucket_objects['Contents'][-1]['Key'])
for obj in bucket_objects['Contents']:
total_size += obj['Size']
size_in_MB = total_size/1000000.0
print('Total size of objects in bucket %s: %.2f MB'
% (bucket['Name'], size_in_MB))
This code runs relatively quickly on buckets that have less than 5 MB or so of data in them, however when I hit a bucket that has 90+ MB of data in it, execution jumps up from milliseconds to 20-30+ seconds.
My hope was to use the threading module to parallelize the I/O portion of the code (getting the list of objects from S3) so that the total size of all objects in the bucket could be added as soon as the thread retrieving them completed rather than having to do that retrieval and addition sequentially.
I understand that Python doesn't support true multithreading because of the GIL, just to avoid getting responses to that effect, but my understanding is that since this is an I/O operation as opposed to a CPU-intensive operation, the threading module should be able to improve the run time.
The main difference between my problem and the several examples I've seen on here of threading implementations is that I'm not iterating over a known list or set. Here I must first retrieve a list of objects, see if the list is truncated, and then retrieve the next list of objects based off of the final object's key in the current list.
Can anyone explain a way to improve the run time of this code, or is it not possible in this situation?
Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.
Set Up Credentials To Connect Python To S3Sign in to the management console. Search for and pull up the S3 homepage. Next, create a bucket. Give it a unique name, choose a region close to you, and keep the other default settings in place (or change them as you see fit).
To use multithreading, we need to import the threading module in Python Program. A start() method is used to initiate the activity of a thread. And it calls only once for each thread so that the execution of the thread can begin.
I ran into similar problems.
It seems to be important to create a separate session for each thread.
So instead of
s3_client = boto3.client('s3')
you need to write
s3_client = boto3.session.Session().client('s3')
otherwise threads interfere with each other, and random errors occur.
Beyond that the normal issues of multithreading apply.
My project is upload 135,000 files to an S3 bucket. So far I have found that I get the best performance with 8 threads. What would otherwise take 3.6 hours, takes 1.25 hours.
I have a solution which may not work in all cases but can cover a good deal of scenarios. If you have objects organised hierarchically in subfolders, then first only list subfolders using mechanism described in this post
Then using these obtained set of prefixes submit them to a multiprocessing pool (or Thread Pool) where each worker will fetch all keys specific to one prefix and collect them in a shared container using multiprocessing Manager. In this way keys will be fetched in parallel.
Above solution will perform best if keys are distributed evenly and hierarchically and worst if data is organized flat.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With