I'd like to know how multiprocessing is done right. Assuming I have a list <code>[1,2,3,4,5]</code> generated by function <code>f1</code> which is written to a <code>Queue</code> (left green circle). Now I start two processes pulling from that queue (by executing <code>f2</code> in the processes). They process the data, say: doubling the value, and write it to the second queue. Now, function <code>f3</code> reads this data and prints it out. <img src="https://i.stack.imgur.com/D4PAd.png" alt="layout of the data flow"> Inside the functions there is a kind of a loop, trying to read from the queue forever. How do I stop this process? Idea 1 <code>f1</code> does not only send the list, but also a <code>None</code> object or a custon object, <code>class PipelineTerminator: pass</code> or some such which is just being propagated all the way down. <code>f3</code> now waits for <code>None</code> to come, when it's there, it breaks out of the loop. Problem: it's possible that one of the two <code>f2</code>s reads and propagates the <code>None</code> while the other one it still processing a number. Then the last value is lost. Idea 2 <code>f3</code> is <code>f1</code>. So the function <code>f1</code> generates the data and the pipes, spawns the processes with <code>f2</code> and feeds all data. After spawning and feeding, it listens on the second pipe, simply counting and processing the received objects. Because it knows how much data fed, it can terminate the processes executing <code>f2</code>. But if the target is to set up a processing pipeline, the different steps should be separable. So <code>f1</code>, <code>f2</code> and <code>f3</code> are different elements of a pipeline, and the expensive steps are done in parallel. Idea 3 <img src="https://i.stack.imgur.com/WtH1a.png" alt="pipeline idea 3"> Each piece of the pipeline is a function, this function spawns processes as it likes to and is responsible to manage them. It knows, how much data came in and how much data has been returned (with <code>yield</code> maybe). So it's safe to propagate a <code>None</code> object. <pre class="prettyprint"><code>setup child processes execute thread one and two and wait until both finished thread 1: while True: pull from input queue if None: break and set finished_flag else: push to queue1 and increment counter1 thread 2: while True: pull from queue2 increment counter2 yield result if counter1 == counter2 and finished_flag: break when both threads finished: kill process pool and return. </code></pre> (Instead of using threads, maybe one can think of a smarter solution.) So ... I have implemented a solution following idea 2, feeding and waiting for the results to arrive, but it was not really a pipeline with independent functions plugged together. It worked for the task I had to manage, but was hard to maintain. I'd like to hear from you now how you implement pipelines (easy in one process with generator functions and so on, but with multiple processes?) and manage them usually.

With MPipe module, simply do this: <pre class="prettyprint"><code>from mpipe import OrderedStage, Pipeline def f1(value): return value * 2 def f2(value): print(value) s1 = OrderedStage(f1, size=2) s2 = OrderedStage(f2) p = Pipeline(s1.link(s2)) for task in 1, 2, 3, 4, 5, None: p.put(task) </code></pre> The above runs 4 processes: <ul> <li> two for the first stage (function f1)</li> <li> one for the second stage (function f2)</li> <li>and one more for the main program that feeds the pipeline.</li> </ul> The MPipe cookbook offers some explanation of how processes are shut down internally using <code>None</code> as the last task. To run the code, install MPipe: <pre class="prettyprint"><code>virtualenv venv venv/bin/pip install mpipe venv/bin/python prog.py </code></pre> Output: <pre class="prettyprint"><code>2 4 6 8 10 </code></pre>

Multiprocessing in a pipeline done right

Tags:

python

multiprocessing

I'd like to know how multiprocessing is done right. Assuming I have a list [1,2,3,4,5] generated by function f1 which is written to a Queue (left green circle). Now I start two processes pulling from that queue (by executing f2 in the processes). They process the data, say: doubling the value, and write it to the second queue. Now, function f3 reads this data and prints it out.

layout of the data flow

Inside the functions there is a kind of a loop, trying to read from the queue forever. How do I stop this process?

Idea 1

f1 does not only send the list, but also a None object or a custon object, class PipelineTerminator: pass or some such which is just being propagated all the way down. f3 now waits for None to come, when it's there, it breaks out of the loop. Problem: it's possible that one of the two f2s reads and propagates the None while the other one it still processing a number. Then the last value is lost.

Idea 2

f3 is f1. So the function f1 generates the data and the pipes, spawns the processes with f2 and feeds all data. After spawning and feeding, it listens on the second pipe, simply counting and processing the received objects. Because it knows how much data fed, it can terminate the processes executing f2. But if the target is to set up a processing pipeline, the different steps should be separable. So f1, f2 and f3 are different elements of a pipeline, and the expensive steps are done in parallel.

Idea 3

pipeline idea 3

Each piece of the pipeline is a function, this function spawns processes as it likes to and is responsible to manage them. It knows, how much data came in and how much data has been returned (with yield maybe). So it's safe to propagate a None object.

Click to copy

setup child processes 

execute thread one and two and wait until both finished

thread 1:
    while True:
        pull from input queue
        if None: break and set finished_flag
        else: push to queue1 and increment counter1

thread 2:
    while True:
        pull from queue2
        increment counter2
        yield result
        if counter1 == counter2 and finished_flag: break

when both threads finished: kill process pool and return.

(Instead of using threads, maybe one can think of a smarter solution.)

So ...

I have implemented a solution following idea 2, feeding and waiting for the results to arrive, but it was not really a pipeline with independent functions plugged together. It worked for the task I had to manage, but was hard to maintain.

I'd like to hear from you now how you implement pipelines (easy in one process with generator functions and so on, but with multiple processes?) and manage them usually.

643

asked Nov 26 '11 10:11

wal-o-mat

2 Answers

With MPipe module, simply do this:

Click to copy

from mpipe import OrderedStage, Pipeline

def f1(value):
    return value * 2

def f2(value):
    print(value)

s1 = OrderedStage(f1, size=2)
s2 = OrderedStage(f2)
p = Pipeline(s1.link(s2))

for task in 1, 2, 3, 4, 5, None:
    p.put(task)

The above runs 4 processes:

two for the first stage (function f1)
one for the second stage (function f2)
and one more for the main program that feeds the pipeline.

The MPipe cookbook offers some explanation of how processes are shut down internally using None as the last task.

To run the code, install MPipe:

Click to copy

virtualenv venv
venv/bin/pip install mpipe
venv/bin/python prog.py

Output:

Click to copy

187

answered Oct 06 '22 05:10

Velimir Mlaker

For Idea 1, how about:

Click to copy

import multiprocessing as mp

sentinel=None

def f2(inq,outq):
    while True:
        val=inq.get()
        if val is sentinel:
            break
        outq.put(val*2)

def f3(outq):
    while True:
        val=outq.get()
        if val is sentinel:
            break
        print(val)

def f1():
    num_workers=2
    inq=mp.Queue()
    outq=mp.Queue()
    for i in range(5):
        inq.put(i)
    for i in range(num_workers):        
        inq.put(sentinel)
    workers=[mp.Process(target=f2,args=(inq,outq)) for i in range(2)]
    printer=mp.Process(target=f3,args=(outq,))
    for w in workers:
        w.start()
    printer.start()
    for w in workers:
        w.join()
    outq.put(sentinel)
    printer.join()

if __name__=='__main__':
    f1()

The only difference from the description of Idea 1 is that f2 breaks out of the while-loop when it receives the sentinel (thus terminating itself). f1 blocks until the workers are done (using w.join()) and then sends f3 the sentinel (signaling that it break out of its while-loop).

answered Oct 06 '22 05:10

unutbu

Related questions
                            
                                How to index a list with a TensorFlow tensor?
                            
                                Increase Version number if Travis at github was successful
                            
                                What is Python's sequence protocol?
                            
                                Nested data in Parquet with Python
                            
                                "OSError: [Errno 22] Invalid argument" when read()ing a huge file
                            
                                Share a dictionary of pandas dataframe across multiprocessing python
                            
                                Double requirement given when trying to use pip install pandas
                            
                                Why are attributes lost after copying a Pandas DataFrame
                            
                                get the lists of functions used/called within a function in python
                            
                                How to write Parquet metadata with pyarrow?
                            
                                Finding all combinations based on multiple conditions for a large list
                            
                                What is the difference between a .py file and .ipynb file?
                            
                                Making python generator via c++20 coroutines
                            
                                Class attribute evaluation and generators
                            
                                Django CMS malfunction: Site matching query does not exist
                            
                                How does the right-shift operator work in a python print statement?
                            
                                Create an utf-8 csv file in Python
                            
                                how to use exclude option with pep8.py
                            
                                Python subprocess output to stdout
                            
                                Python ConfigParser interpolation from foreign section

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiprocessing in a pipeline done right

Tags:

python

multiprocessing

wal-o-mat

People also ask

2 Answers

Velimir Mlaker

unutbu

Recent Activity

Donate For Us