Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB.
I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2.
3600 (the amount of files) * 30K (file size) ~~ 105M (Total) --->
**5.5s [ 4 process + 100 coroutine ]**
10s [ 200 coroutine ]
14s [ 10 threads ]
The code as following shown
For multithreading
def mput(i, client, files):
for f in files:
if hash(f) % NTHREAD == i:
put(client, os.path.join(DATA_DIR, f))
def test_multithreading():
client = connect_to_s3_sevice()
files = os.listdir(DATA_DIR)
ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
for th in ths:
th.daemon = True
th.start()
for th in ths:
th.join()
For coroutine
client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))
xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()
For multiprocessing + Coroutine
def pproc(i):
client = connect_to_s3_sevice()
files = os.listdir(DATA_DIR)
pool = eventlet.GreenPool(100)
xput = functools.partial(put, client)
for f in files:
if hash(f) % NPROCESS == i:
pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()
def test_multiproc():
procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
for p in procs:
p.daemon = True
p.start()
for p in procs:
p.join()
The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory
The highest speed reached is about 19Mb/s (105 / 5.5). Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?
Small Files Create Too Much Latency For Data Analytics Since streaming data comes in small files, typically you write these files to S3 rather than combine them on write. But small files impede performance.
The size of an object in S3 can be from a minimum of 0 bytes to a maximum of 5 terabytes, so, if you are looking to upload an object larger than 5 gigabytes, you need to use either multipart upload or split the file into logical chunks of up to 5GB and upload them manually as regular uploads.
When you upload large files to Amazon S3, it's a best practice to leverage multipart uploads. If you're using the AWS Command Line Interface (AWS CLI), then all high-level aws s3 commands automatically perform a multipart upload when the object is large. These high-level commands include aws s3 cp and aws s3 sync.
Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:
Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI), which can do uploads in parallel. It is also written in Python and uses boto.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With