Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloading a Large Number of Files from S3

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).

At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.

Would some type of concurrency help? PyCurl.CurlMulti object?

I am open to all suggestions. Thanks!

like image 468
The Unknown Avatar asked Oct 20 '25 05:10

The Unknown


2 Answers

I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.

like image 103
gburgoon Avatar answered Oct 22 '25 20:10

gburgoon


In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.

If you do use multiple threads, this is a good candidate for the Queue class.

like image 25
Kathy Van Stone Avatar answered Oct 22 '25 18:10

Kathy Van Stone



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!