Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I increase my AWS s3 upload speed when using boto3?

Hey there were some similar questions, but none exactly like this and a fair number of them were multiple years old and out of date.

I have written some code on my server that uploads jpeg photos into an s3 bucket using a key via the boto3 method upload_file. Initially this seemed great. It is a super simple solution to uploading files into s3.

The thing is, I have users. My users are sending their jpegs to my server via a phone app. While I concede that I could generate presigned upload URLs and send them to the phone app, that would require a considerable rewrite of our phone app and API.

So I just want the phone app to send the photos to the server. I then want to send the photos from the server to s3. I implemented this but it is way too slow. I cannot ask my users to tolerate those slow uploads.

What can I do to speed this up?

I did some Google searching and found this: https://medium.com/@alejandro.millan.frias/optimizing-transfer-throughput-of-small-files-to-amazon-s3-or-anywhere-really-301dca4472a5

It suggests that the solution is to increase the number of TCP/IP connections. More TCP/IP connections means faster uploads.

Okay, great!

How do I do that? How do I increase the number of TCP/IP connections so I can upload a single jpeg into AWS s3 faster?

Please help.

like image 781
Peter Jirak Eldritch Avatar asked Jun 17 '19 22:06

Peter Jirak Eldritch


People also ask

How do I maximize the read speed on Amazon S3?

You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.

How fast can you upload to S3?

Upload speed to AWS S3 tops out at 2.3 Mbps.

How do I upload a file greater than 100 megabytes on Amazon S3?

Note: If you use the Amazon S3 console, the maximum file size for uploads is 160 GB. To upload a file that is larger than 160 GB, use the AWS CLI, AWS SDK, or Amazon S3 REST API.


1 Answers

Ironically, we've been using boto3 for years, as well as awscli, and we like them both.

But we've often wondered why awscli's aws s3 cp --recursive, or aws s3 sync, are often so much faster than trying to do a bunch of uploads via boto3, even with concurrent.futures's ThreadPoolExecutor or ProcessPoolExecutor (and don't you even dare sharing the same s3.Bucket among your workers: it's warned against in the docs, and for good reasons; nasty crashes will eventually ensue at the most inconvenient time).

Finally, I bit the bullet and looked inside the "customization" code that awscli introduces on top of boto3.

Based on that little exploration, here is a way to speed up the upload of many files to S3 by using the concurrency already built in boto3.s3.transfer, not just for the possible multiparts of a single, large file, but for a whole bunch of files of various sizes as well. That functionality is, as far as I know, not exposed through the higher level APIs of boto3 that are described in the boto3 docs.

The following:

  1. Uses boto3.s3.transfer to create a TransferManager, the very same one that is used by awscli's aws s3 sync, for example.

  2. Extends the max number of threads to 20.

  3. Augments the underlying urllib3 max pool connections capacity used by botocore to match (by default, it uses 10 connections maximum).

  4. Gives you an optional callback capability (demoed here with a tqdm progress bar, but of course you can have whatever callback you'd like).

  5. Is fast (over 100MB/s --tested on an ec2 instance).

I put a complete example as a gist here that includes the generation of 500 random csv files for a total of about 360MB. Here below, we assume you already have a bunch of files in filelist, for a total of totalsize bytes:

import os
import boto3
import botocore
import boto3.s3.transfer as s3transfer

def fast_upload(session, bucketname, s3dir, filelist, progress_func, workers=20):
    botocore_config = botocore.config.Config(max_pool_connections=workers)
    s3client = session.client('s3', config=botocore_config)
    transfer_config = s3transfer.TransferConfig(
        use_threads=True,
        max_concurrency=workers,
    )
    s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
    for src in filelist:
        dst = os.path.join(s3dir, os.path.basename(src))
        s3t.upload(
            src, bucketname, dst,
            subscribers=[
                s3transfer.ProgressCallbackInvoker(progress_func),
            ],
        )
    s3t.shutdown()  # wait for all the upload tasks to finish

Example usage

from tqdm import tqdm

bucketname = '<your-bucket-name>'
s3dir = 'some/path/for/junk'
filelist = [...]
totalsize = sum([os.stat(f).st_size for f in filelist])

with tqdm(desc='upload', ncols=60,
          total=totalsize, unit='B', unit_scale=1) as pbar:
    fast_upload(boto3.Session(), bucketname, s3dir, filelist, pbar.update)
like image 199
Pierre D Avatar answered Oct 21 '22 12:10

Pierre D