Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Multi-Processing Question?

Tags:

python

zeromq

I have a folder with 500 input files (total size of all files is ~ 500[MB]).

I'd like to write a python script that does the following:

(1) load all of the input files to memory

(2) initializes an empty python list that will later be used ... see bullet (4)

(3) start 15 different (independent) processes: each of these uses the same input data [from (1)] -- yet uses a different algorithms to processes it, thus generating different results

(4) I'd like all the independent processes [from step (3)] to store their output in the same python list [same list that was initialized in step (2)]

Once all 15 processes have completed their run, I will have one python list that includes the results of all the 15 independent processes.

My question is, is it possible to do the above efficiently in python? if so, can you provide a scheme / sample code that illustrates how to do so?

Note #1: I will be running this on a strong, multi-core server; so the goal here is to use all the processing power while sharing some memory {input data, output list} among all the independent processes.

Note #2: I am working in a Linux environment

like image 623
user3262424 Avatar asked Jun 30 '11 23:06

user3262424


1 Answers

ok I just whipped this up using zeromq to demonstrate a single subscriber to multiple publishers. You could probably do the same with queues but you would need to manage them a bit more. zeromq sockets just work which makes it nice for things like this IMO.

"""
demo of multiple processes doing processing and publishing the results
to a common subscriber
"""
from multiprocessing import Process


class Worker(Process):
    def __init__(self, filename, bind):
        self._filename = filename
        self._bind = bind
        super(Worker, self).__init__()

    def run(self):
        import zmq
        import time
        ctx = zmq.Context()
        result_publisher = ctx.socket(zmq.PUB)
        result_publisher.bind(self._bind)
        time.sleep(1)
        with open(self._filename) as my_input:
            for l in my_input.readlines():
                result_publisher.send(l)

if __name__ == '__main__':
    import sys
    import os
    import zmq

    #assume every argument but the first is a file to be processed
    files = sys.argv[1:]

    # create a worker for each file to be processed if it exists pass
    # in a bind argument instructing the socket to communicate via ipc
    workers = [Worker(f, "ipc://%s_%s" % (f, i)) for i, f \
               in enumerate((x for x in files if os.path.exists(x)))]

    # create subscriber socket
    ctx = zmq.Context()

    result_subscriber = ctx.socket(zmq.SUB)
    result_subscriber.setsockopt(zmq.SUBSCRIBE, "")

    # wire up subscriber to whatever the worker is bound to 
    for w in workers:
        print w._bind
        result_subscriber.connect(w._bind)

    # start workers
    for w in workers:
        print "starting workers..."
        w.start()

    result = []

    # read from the subscriber and add it to the result list as long
    # as at least one worker is alive
    while [w for w in workers if w.is_alive()]:
        result.append(result_subscriber.recv())
    else:
        # output the result
        print result

oh and to get zmq just

$ pip install pyzmq-static
like image 197
Tom Willis Avatar answered Sep 21 '22 05:09

Tom Willis