Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiprocessing in Python while limiting the number of running processes

I'd like to run multiple instances of program.py simultaneously, while limiting the number of instances running at the same time (e.g. to the number of CPU cores available on my system). For example, if I have 10 cores and have to do 1000 runs of program.py in total, only 10 instances will be created and running at any given time.

I've tried using the multiprocessing module, multithreading, and using queues, but there's nothing that seemed to me to lend itself to an easy implementation. The biggest problem I have is finding a way to limit the number of processes running simultaneously. This is important because if I create 1000 processes at once, it becomes equivalent to a fork bomb. I don't need the results returned from the processes programmatically (they output to disk), and the processes all run independently of each other.

Can anyone please give me suggestions or an example of how I could implement this in python, or even bash? I'd post the code I've written so far using queues, but it doesn't work as intended and might already be down the wrong path.

Many thanks.

like image 209
steadfast Avatar asked Aug 16 '12 22:08

steadfast


People also ask

When should I use multiprocessing in Python?

CPU time gets rationed out between the threads. Multiprocessing is for times when you really do want more than one thing to be done at any given time. Suppose your application needs to connect to 6 databases and perform a complex matrix transformation on each dataset.

How lock in multiprocessing Python?

Python provides a mutual exclusion lock for use with processes via the multiprocessing. Lock class. An instance of the lock can be created and then acquired by processes before accessing a critical section, and released after the critical section. Only one process can have the lock at any time.

Does multiprocessing make Python faster?

Multiprocessing in Python enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel. Multiprocessing enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel. This parallelization leads to significant speedup in tasks that involve a lot of computation.

How does multiprocessing process work in Python?

The multiprocessing package supports spawning processes. It refers to a function that loads and executes a new child processes. For the child to terminate or to continue executing concurrent computing,then the current process hasto wait using an API, which is similar to threading module.


Video Answer


2 Answers

I know you mentioned that the Pool.map approach doesn't make much sense to you. The map is just an easy way to give it a source of work, and a callable to apply to each of the items. The func for the map could be any entry point to do the actual work on the given arg.

If that doesn't seem right for you, I have a pretty detailed answer over here about using a Producer-Consumer pattern: https://stackoverflow.com/a/11196615/496445

Essentially, you create a Queue, and start N number of workers. Then you either feed the queue from the main thread, or create a Producer process that feeds the queue. The workers just keep taking work from the queue and there will never be more concurrent work happening than the number of processes you have started.

You also have the option of putting a limit on the queue, so that it blocks the producer when there is already too much outstanding work, if you need to put constraints also on the speed and resources that the producer consumes.

The work function that gets called can do anything you want. This can be a wrapper around some system command, or it can import your python lib and run the main routine. There are specific process management systems out there which let you set up configs to run your arbitrary executables under limited resources, but this is just a basic python approach to doing it.

Snippets from that other answer of mine:

Basic Pool:

from multiprocessing import Pool

def do_work(val):
    # could instantiate some other library class,
    # call out to the file system,
    # or do something simple right here.
    return "FOO: %s" % val

pool = Pool(4)
work = get_work_args()
results = pool.map(do_work, work)

Using a process manager and producer

from multiprocessing import Process, Manager
import time
import itertools

def do_work(in_queue, out_list):
    while True:
        item = in_queue.get()

        # exit signal 
        if item == None:
            return

        # fake work
        time.sleep(.5)
        result = item

        out_list.append(result)


if __name__ == "__main__":
    num_workers = 4

    manager = Manager()
    results = manager.list()
    work = manager.Queue(num_workers)

    # start for workers    
    pool = []
    for i in xrange(num_workers):
        p = Process(target=do_work, args=(work, results))
        p.start()
        pool.append(p)

    # produce data
    # this could also be started in a producer process
    # instead of blocking
    iters = itertools.chain(get_work_args(), (None,)*num_workers)
    for item in iters:
        work.put(item)

    for p in pool:
        p.join()

    print results
like image 97
jdi Avatar answered Sep 18 '22 15:09

jdi


You should use a process supervisor. One approach would be using the API provided by Circus to do that "programatically", the documentation site is now offline but I think its just a temporary problem, anyway, you can use the Circus to handle this. Another approach would be using the supervisord and setting the parameter numprocs of the process to the number of cores you have.

An example using Circus:

from circus import get_arbiter

arbiter = get_arbiter("myprogram", numprocesses=3)
try:
    arbiter.start()
finally:
    arbiter.stop()
like image 30
Tarantula Avatar answered Sep 20 '22 15:09

Tarantula