Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB. I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2. 3600 (the amount of files) * 30K (file size) ~~ 105M (Total) ---> <pre class="prettyprint"><code> **5.5s [ 4 process + 100 coroutine ]** 10s [ 200 coroutine ] 14s [ 10 threads ] </code></pre> The code as following shown For multithreading <pre class="prettyprint"><code>def mput(i, client, files): for f in files: if hash(f) % NTHREAD == i: put(client, os.path.join(DATA_DIR, f)) def test_multithreading(): client = connect_to_s3_sevice() files = os.listdir(DATA_DIR) ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)] for th in ths: th.daemon = True th.start() for th in ths: th.join() </code></pre> For coroutine <pre class="prettyprint"><code>client = connect_to_s3_sevice() pool = eventlet.GreenPool(int(sys.argv[2])) xput = functools.partial(put, client) files = os.listdir(DATA_DIR) for f in files: pool.spawn_n(xput, os.path.join(DATA_DIR, f)) pool.waitall() </code></pre> For multiprocessing + Coroutine <pre class="prettyprint"><code>def pproc(i): client = connect_to_s3_sevice() files = os.listdir(DATA_DIR) pool = eventlet.GreenPool(100) xput = functools.partial(put, client) for f in files: if hash(f) % NPROCESS == i: pool.spawn_n(xput, os.path.join(DATA_DIR, f)) pool.waitall() def test_multiproc(): procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)] for p in procs: p.daemon = True p.start() for p in procs: p.join() </code></pre> The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory The highest speed reached is about 19Mb/s (105 / 5.5). Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?

Sample parallel upload times to Amazon S3 using the Python boto SDK are available here: <ul> <li>Parallel S3 Uploads Using Boto and Threads in Python</li> </ul> Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI), which can do uploads in parallel. It is also written in Python and uses boto.

How to upload small files to Amazon S3 efficiently in Python

Tags:

python

coroutine

amazon-web-services

amazon-s3

Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB.

I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2.

3600 (the amount of files) * 30K (file size) ~~ 105M (Total) --->

       **5.5s [ 4 process + 100 coroutine ]**
       10s  [ 200 coroutine ]
       14s  [ 10 threads ]

The code as following shown

For multithreading

def mput(i, client, files):
    for f in files:
        if hash(f) % NTHREAD == i:
            put(client, os.path.join(DATA_DIR, f))


def test_multithreading():
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
    for th in ths:
        th.daemon = True
        th.start()
    for th in ths:
        th.join()

For coroutine

client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))

xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
    pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()

For multiprocessing + Coroutine

def pproc(i):
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    pool = eventlet.GreenPool(100)

    xput = functools.partial(put, client)
    for f in files:
        if hash(f) % NPROCESS == i:
            pool.spawn_n(xput, os.path.join(DATA_DIR, f))
    pool.waitall()


def test_multiproc():
    procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
    for p in procs:
        p.daemon = True
        p.start()
    for p in procs:
        p.join()

The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory

The highest speed reached is about 19Mb/s (105 / 5.5). Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?

886

asked Dec 15 '14 06:12

Jacky1205

1 Answers

Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:

Parallel S3 Uploads Using Boto and Threads in Python

Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI), which can do uploads in parallel. It is also written in Python and uses boto.

answered Oct 04 '22 20:10

John Rotenstein

Related questions
                            
                                pytorch, AttributeError: module 'torch' has no attribute 'Tensor'
                            
                                df.groupby(...).agg(set) produces different result compared to df.groupby(...).agg(lambda x: set(x))
                            
                                Pandas transform inconsistent behavior for list
                            
                                Remove white borders from segmented images
                            
                                Python Regular Expressions to implement string unescaping
                            
                                How do you host your own egg repository?
                            
                                plotting lines without blocking execution
                            
                                Find all coordinates within a circle in geographic data in python
                            
                                How to split a Python module into multiple files?
                            
                                How to include a template with relative path in Jinja2
                            
                                How to translate this Math Formula in Haskell or Python? (Was translated in PHP)
                            
                                How does Moose compare to Python's OO system? [closed]
                            
                                Hashing an immutable dictionary in Python
                            
                                Is it possible to create a numpy.ndarray that holds complex integers?
                            
                                Assign new values to slice from MultiIndex DataFrame
                            
                                pip freeze does not show all installed packages
                            
                                How to read a v7.3 mat file via h5py?
                            
                                Folder and files upload with Flask
                            
                                Most pythonic and/or performant way to assign a single value to a slice?
                            
                                Python ThreadPoolExecutor - is the callback guaranteed to run in the same thread as submitted func?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With