Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Suggestions on distributing python data/code over worker nodes?

I'm starting to venture into distributed code and am having trouble figuring out which solution fits my needs based on all the stuff out there. Basically I have a python list of data that I need to process with a single function. This function has a few nested for loops but doesn't take too long(about a min) for each item on the list. My problem is the list is very large(3000+ items). I'm looking at multiprocessing but I think I want to experiment with multi-server processing it(because ideally, if the data gets larger I want to be able to have the choice of adding more servers during the job to make it run quicker).

I basically looking for something that I can distribute this data list through(and not super needed but it would be nice if I could distribute my code base through this also)

So my question is, what package can I use to achieve this? My database is hbase so I already have hadoop running(never used hadoop though, just using it for the database). I looked at celery and twisted as well but I'm confused on which will fit my needs.

Any suggestions?

like image 369
Lostsoul Avatar asked Feb 16 '12 20:02

Lostsoul


2 Answers

I would highly recommend celery. You can define a task that operates on a single item of your list:

from celery.task import task
@task
def process(i):
    # do something with i
    i += 1
    # return a result
    return i

You can easily parallelize a list like this:

results = []
todo = [1,2,3,4,5]
for arg in todo:
    res = process.apply_async(args=(arg))
    results.append(res)

all_results = [res.get() for res in results]

This is easily scalable by just adding more celery workers.

like image 91
jterrace Avatar answered Oct 18 '22 07:10

jterrace


check out rabbitMQ. Python bindings are available through pika. start with a simple work_queue and run few rpc calls.

It may look troublesome to experiment distributed computing in python with an external engine like rabbitMQ (there's a small learning curve for installing and configuring the rabbit) but you may find it even more useful later.

... and celery can work hand-in-hand with rabbitMQ, checkout robert pogorzelski's tutorial and Simple distributed tasks with Celery and RabbitMQ

like image 37
user237419 Avatar answered Oct 18 '22 07:10

user237419