Throughput differences when using coroutines vs threading

Tags:

A few days ago I has asked a question on SO about helping me design a paradigm for structuring multiple HTTP requests

Here's the scenario. I would like a have a multi-producer, multi-consumer system. My producers crawl and scrape a few sites and add the links that it finds into a queue. Since I'll be crawling multiple sites, I would like to have multiple producers/crawlers.

The consumers/workers feed off this queue, make TCP/UDP requests to these links and saves the results to my Django DB. I would also like to have multiple-workers as each queue item is totally independent of each other.

People suggested that use a coroutine library for this i.e. Gevent or Eventlet. Having never worked with coroutines, I read that even though the programming paradigm is similar to threaded paradigms, only one thread is actively executing but when blocking calls occur - such as I/O calls - the stacks are switched in-memory and the other green thread takes over until it encounters some sort of a blocking I/O call. Hopefully I got this right? Here's the code from one of my SO posts:

import gevent
from gevent.queue import *
import time
import random

q = JoinableQueue()
workers = []
producers = []


def do_work(wid, value):
    gevent.sleep(random.randint(0,2))
    print 'Task', value, 'done', wid


def worker(wid):
    while True:
        item = q.get()
        try:
            print "Got item %s" % item
            do_work(wid, item)
        finally:
            print "No more items"
            q.task_done()


def producer():
    while True:
        item = random.randint(1, 11)
        if item == 10:
            print "Signal Received"
            return
        else:
            print "Added item %s" % item
            q.put(item)


for i in range(4):
    workers.append(gevent.spawn(worker, random.randint(1, 100000)))

# This doesn't work.
for j in range(2):
    producers.append(gevent.spawn(producer))

# Uncommenting this makes this script work.
# producer()

q.join()

This works well because the sleep calls are blocking calls and when a sleep event occurs, another green thread takes over. This is a lot faster than sequential execution. As you can see, I don't have any code in my program that purposely yields the execution of one thread to another thread. I fail to see how this fits into scenario above as I would like to have all the threads executing simultaneously.

All works fine, but I feel the throughput that I've achieved using Gevent/Eventlets is higher than the original sequentially running program but drastically lower than what could be achieved using real-threading.

If I were to re-implement my program using threading mechanisms, each of my producers and consumers could simultaneously be working without the need to swap stacks in and out like coroutines.

Should this be re-implemented using threading? Is my design wrong? I've failed to see the real benefits of using coroutines.

Maybe my concepts are little muddy but this is what I've assimilated. Any help or clarification of my paradigm and concepts would be great.

Thanks

952

asked Feb 12 '12 09:02

Mridang Agarwalla

1 Answers

As you can see, I don't have any code in my program that purposely yields the execution of one thread to another thread. I fail to see how this fits into scenario above as I would like to have all the threads executing simultaneously.

There is a single OS thread but several greenlets. In your case gevent.sleep() allows workers to execute concurrently. Blocking IO calls such as urllib2.urlopen(url).read() do the same if you use urllib2 patched to work with gevent (by calling gevent.monkey.patch_*()).

See also A Curious Course on Coroutines and Concurrency to understand how a code can work concurrently in a single threaded environment.

To compare throughput differences between gevent, threading, multiprocessing you could write the code that compatible with all aproaches:

#!/usr/bin/env python
concurrency_impl = 'gevent' # single process, single thread
##concurrency_impl = 'threading' # single process, multiple threads
##concurrency_impl = 'multiprocessing' # multiple processes

if concurrency_impl == 'gevent':
    import gevent.monkey; gevent.monkey.patch_all()

import logging
import time
import random
from itertools import count, islice

info = logging.info

if concurrency_impl in ['gevent', 'threading']:
    from Queue import Queue as JoinableQueue
    from threading import Thread
if concurrency_impl == 'multiprocessing':
    from multiprocessing import Process as Thread, JoinableQueue

The rest of the script is the same for all concurrency implementations:

def do_work(wid, value):
    time.sleep(random.randint(0,2))
    info("%d Task %s done" % (wid, value))

def worker(wid, q):
    while True:
        item = q.get()
        try:
            info("%d Got item %s" % (wid, item))
            do_work(wid, item)
        finally:
            q.task_done()
            info("%d Done item %s" % (wid, item))

def producer(pid, q):
    for item in iter(lambda: random.randint(1, 11), 10):
        time.sleep(.1) # simulate a green blocking call that yields control
        info("%d Added item %s" % (pid, item))
        q.put(item)
    info("%d Signal Received" % (pid,))

Don't execute code at a module level put it in main():

def main():
    logging.basicConfig(level=logging.INFO,
                        format="%(asctime)s %(process)d %(message)s")

    q = JoinableQueue()
    it = count(1)
    producers = [Thread(target=producer, args=(i, q)) for i in islice(it, 2)]
    workers = [Thread(target=worker, args=(i, q)) for i in islice(it, 4)]
    for t in producers+workers:
        t.daemon = True
        t.start()

    for t in producers: t.join() # put items in the queue
    q.join() # wait while it is empty
    # exit main thread (daemon workers die at this point)

if __name__=="__main__":    
   main()

answered Oct 12 '22 22:10

jfs

Related questions
                            
                                ElasticSearch pagination through pyes. Offset ignored
                            
                                How to maintain when Django switches to Python 3?
                            
                                How to run a python script from another python script and get the returned status code?
                            
                                Get translatable text from an external source into catalog
                            
                                Python Tkinter Text Widget with Auto & Custom Scroll
                            
                                Flattening nested loops / decreasing complexity - complementary pairs counting algorithm
                            
                                Why can't I change the __metaclass__ attribute of a class?
                            
                                Is there distributed task queue in Python that enables me to kill hanging tasks that are not willing to cooperate
                            
                                Why does this Python 0MQ script for distributed computing hang at a fixed input size?
                            
                                Uploading a csv file with a fixed format
                            
                                Does anyone know of any good complete resources to achieve google authentication using python?
                            
                                PyQt4 Results in QThread error
                            
                                Show objects and references graphically in Python
                            
                                In Python, can I prevent a function from catching KeyboardInterrupt and SystemExit?
                            
                                pygtk window with box that ignores all X(mouse)events (passes them through)
                            
                                Computing N smallest eigenvalues of Sparse Matrix in Python
                            
                                SqlAlchemy relationship to specific columns
                            
                                Django, Virtualenv, nginx + uwsgi import module wsgi error
                            
                                How to run a MS Access macro from python
                            
                                Mongo push to array inside array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Throughput differences when using coroutines vs threading

Tags:

python

coroutine

multithreading

gevent

Mridang Agarwalla

People also ask

1 Answers

jfs

Recent Activity

Donate For Us