How can I increase my AWS s3 upload speed when using boto3?

Tags:

Hey there were some similar questions, but none exactly like this and a fair number of them were multiple years old and out of date.

I have written some code on my server that uploads jpeg photos into an s3 bucket using a key via the boto3 method upload_file. Initially this seemed great. It is a super simple solution to uploading files into s3.

The thing is, I have users. My users are sending their jpegs to my server via a phone app. While I concede that I could generate presigned upload URLs and send them to the phone app, that would require a considerable rewrite of our phone app and API.

So I just want the phone app to send the photos to the server. I then want to send the photos from the server to s3. I implemented this but it is way too slow. I cannot ask my users to tolerate those slow uploads.

What can I do to speed this up?

I did some Google searching and found this: https://medium.com/@alejandro.millan.frias/optimizing-transfer-throughput-of-small-files-to-amazon-s3-or-anywhere-really-301dca4472a5

It suggests that the solution is to increase the number of TCP/IP connections. More TCP/IP connections means faster uploads.

Okay, great!

How do I do that? How do I increase the number of TCP/IP connections so I can upload a single jpeg into AWS s3 faster?

Please help.

781

asked Jun 17 '19 22:06

Peter Jirak Eldritch

1 Answers

Ironically, we've been using boto3 for years, as well as awscli, and we like them both.

But we've often wondered why awscli's aws s3 cp --recursive, or aws s3 sync, are often so much faster than trying to do a bunch of uploads via boto3, even with concurrent.futures's ThreadPoolExecutor or ProcessPoolExecutor (and don't you even dare sharing the same s3.Bucket among your workers: it's warned against in the docs, and for good reasons; nasty crashes will eventually ensue at the most inconvenient time).

Finally, I bit the bullet and looked inside the "customization" code that awscli introduces on top of boto3.

Based on that little exploration, here is a way to speed up the upload of many files to S3 by using the concurrency already built in boto3.s3.transfer, not just for the possible multiparts of a single, large file, but for a whole bunch of files of various sizes as well. That functionality is, as far as I know, not exposed through the higher level APIs of boto3 that are described in the boto3 docs.

The following:

Uses boto3.s3.transfer to create a TransferManager, the very same one that is used by awscli's aws s3 sync, for example.
Extends the max number of threads to 20.
Augments the underlying urllib3 max pool connections capacity used by botocore to match (by default, it uses 10 connections maximum).
Gives you an optional callback capability (demoed here with a tqdm progress bar, but of course you can have whatever callback you'd like).
Is fast (over 100MB/s --tested on an ec2 instance).

I put a complete example as a gist here that includes the generation of 500 random csv files for a total of about 360MB. Here below, we assume you already have a bunch of files in filelist, for a total of totalsize bytes:

import os
import boto3
import botocore
import boto3.s3.transfer as s3transfer

def fast_upload(session, bucketname, s3dir, filelist, progress_func, workers=20):
    botocore_config = botocore.config.Config(max_pool_connections=workers)
    s3client = session.client('s3', config=botocore_config)
    transfer_config = s3transfer.TransferConfig(
        use_threads=True,
        max_concurrency=workers,
    )
    s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
    for src in filelist:
        dst = os.path.join(s3dir, os.path.basename(src))
        s3t.upload(
            src, bucketname, dst,
            subscribers=[
                s3transfer.ProgressCallbackInvoker(progress_func),
            ],
        )
    s3t.shutdown()  # wait for all the upload tasks to finish

Example usage

from tqdm import tqdm

bucketname = '<your-bucket-name>'
s3dir = 'some/path/for/junk'
filelist = [...]
totalsize = sum([os.stat(f).st_size for f in filelist])

with tqdm(desc='upload', ncols=60,
          total=totalsize, unit='B', unit_scale=1) as pbar:
    fast_upload(boto3.Session(), bucketname, s3dir, filelist, pbar.update)

199

answered Oct 21 '22 12:10

Pierre D

Related questions
                            
                                Lambda and S3 Permission denied when want to create file
                            
                                How to access buckets with boto3
                            
                                How to mock AWS API using Mockito in java
                            
                                Terminate SSL on Application Load Balancer in front of Elastic Beanstalk
                            
                                How to Add iamRoleStatements to S3 Trigger Bucket in Serverless Framework
                            
                                Verifying S3 credentials w/o GET or PUT using boto3
                            
                                Overwrite MySQL tables with AWS Glue
                            
                                How do I get the date/requestTime/timestamp in Mapping Template
                            
                                SerializationException:Start of structure or map found where not expected: API Gateway to Step function
                            
                                Amazon cloudwatch agent not working
                            
                                Terraform : Specifying the working directory when running terraform apply/plan
                            
                                Using AWS SES SMTP- error 554 Message rejected: Email address is not verified. The following identities failed the check in region
                            
                                AWS Lambda: call function from another AWS lambda using boto3 invoke
                            
                                Boto3 AWS S3 bucket creation error
                            
                                Disable "delete" option for S3 objects in AWS
                            
                                AWS Route 53 wildcard subdomain with Api gateway
                            
                                How do I export AWS AppSync resolvers?
                            
                                S3 Access column shows "Error" for all buckets
                            
                                How to POST request to postman using AWS Lambda
                            
                                The provided token is malformed or otherwise invalid

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I increase my AWS s3 upload speed when using boto3?

Tags:

amazon-web-services

amazon-s3

boto3

Peter Jirak Eldritch

People also ask

1 Answers

Pierre D

Recent Activity

Donate For Us