Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to download 3 million objects from a S3 bucket

I've tried using Python + boto + multiprocessing, S3cmd and J3tset but struggling with all of them.

Any suggestions, perhaps a ready-made script you've been using or another way I don't know of?

EDIT:

eventlet+boto is a worthwhile solution as mentioned below. Found a good eventlet reference article here http://web.archive.org/web/20110520140439/http://teddziuba.com/2010/02/eventlet-asynchronous-io-for-g.html

I've added the python script that I'm using right now below.

like image 572
Jagtesh Chadha Avatar asked Jan 18 '11 05:01

Jagtesh Chadha


People also ask

How do you make S3 download faster?

If you want to speed up S3 uploads or downloads on smaller files, try Cloudfront instead. Make sure only to use the S3 accelerate endpoint selectively to avoid extra costs on data transfers you may not care to speed up.

What is the maximum capacity of S3 bucket?

Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB.

How fast is S3 copy?

S3-bucket-copying performance can exceed 8 gigabytes per second.


1 Answers

Okay, I figured out a solution based on @Matt Billenstien's hint. It uses eventlet library. The first step is most important here (monkey patching of standard IO libraries).

Run this script in the background with nohup and you're all set.

from eventlet import * patcher.monkey_patch(all=True)  import os, sys, time from boto.s3.connection import S3Connection from boto.s3.bucket import Bucket  import logging  logging.basicConfig(filename="s3_download.log", level=logging.INFO)   def download_file(key_name):     # Its imp to download the key from a new connection     conn = S3Connection("KEY", "SECRET")     bucket = Bucket(connection=conn, name="BUCKET")     key = bucket.get_key(key_name)      try:         res = key.get_contents_to_filename(key.name)     except:         logging.info(key.name+":"+"FAILED")  if __name__ == "__main__":     conn = S3Connection("KEY", "SECRET")     bucket = Bucket(connection=conn, name="BUCKET")      logging.info("Fetching bucket list")     bucket_list = bucket.list(prefix="PREFIX")      logging.info("Creating a pool")     pool = GreenPool(size=20)      logging.info("Saving files in bucket...")     for key in bucket.list():         pool.spawn_n(download_file, key.key)     pool.waitall() 
like image 168
Jagtesh Chadha Avatar answered Sep 22 '22 20:09

Jagtesh Chadha