Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Same output in different workers in multiprocessing

I have very simple cases where the work to be done can be broken up and distributed among workers. I tried a very simple multiprocessing example from here:

import multiprocessing
import numpy as np
import time

def do_calculation(data):
    rand=np.random.randint(10)
    print data, rand
    time.sleep(rand)
    return data * 2

if __name__ == '__main__':
    pool_size = multiprocessing.cpu_count() * 2
    pool = multiprocessing.Pool(processes=pool_size)

    inputs = list(range(10))
    print 'Input   :', inputs

    pool_outputs = pool.map(do_calculation, inputs)
    print 'Pool    :', pool_outputs

The above program produces the following output :

Input   : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0 7
1 7
2 7
5 7
3 7
4 7
6 7
7 7
8 6
9 6
Pool    : [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Why is the same random number getting printed? (I have 4 cpus in my machine). Is this the best/simplest way to go ahead?

like image 926
imsc Avatar asked Oct 16 '12 12:10

imsc


People also ask

What is multiprocess synchronization?

Synchronization between processes Multiprocessing is a package which supports spawning processes using an API. This package is used for both local and remote concurrencies. Using this module, programmer can use multiple processors on a given machine. It runs on Windows and UNIX os.

How does multiprocessing process work?

Multiprocessing refers to the ability of a system to support more than one processor at the same time. Applications in a multiprocessing system are broken to smaller routines that run independently. The operating system allocates these threads to the processors improving performance of the system.

What is the difference between pool and process in multiprocessing?

As we have seen, the Process allocates all the tasks in memory and Pool allocates only executing processes in memory, so when the task numbers is large, we can use Pool and when the task number is small, we can use Process class.

What is multiprocessing dummy?

multiprocessing. dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module. That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs.


2 Answers

I think you'll need to re-seed the random number generator using numpy.random.seed in your do_calculation function.

My guess is that the random number generator (RNG) gets seeded when you import the module. Then, when you use multiprocessing, you fork the current process with the RNG already seeded -- Thus, all your processes are sharing the same seed value for the RNG and so they'll generate the same sequences of numbers.

e.g.:

def do_calculation(data):
    np.random.seed()
    rand=np.random.randint(10)
    print data, rand
    return data * 2
like image 149
mgilson Avatar answered Oct 02 '22 17:10

mgilson


This blog post provides an example of a good and bad practise when using numpy.random and multi-processing. The more important is to understand when the seed of your pseudo random number generator (PRNG) is created:

import numpy as np
import pprint
from multiprocessing import Pool

pp = pprint.PrettyPrinter()

def bad_practice(index):
    return np.random.randint(0,10,size=10)

def good_practice(index):
    return np.random.RandomState().randint(0,10,size=10)

p = Pool(5)

pp.pprint("Bad practice: ")
pp.pprint(p.map(bad_practice, range(5)))
pp.pprint("Good practice: ")
pp.pprint(p.map(good_practice, range(5)))

output:

'Bad practice: '
[array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
 array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
 array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
 array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
 array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9])]
'Good practice: '
[array([8, 9, 4, 5, 1, 0, 8, 1, 5, 4]),
 array([5, 1, 3, 3, 3, 0, 0, 1, 0, 8]),
 array([1, 9, 9, 9, 2, 9, 4, 3, 2, 1]),
 array([4, 3, 6, 2, 6, 1, 2, 9, 5, 2]),
 array([6, 3, 5, 9, 7, 1, 7, 4, 8, 5])]

In the good practice the seed is created once per thread while in the bad practise the seed is created only once when you import the numpy.random module.

like image 38
t_sic Avatar answered Oct 02 '22 19:10

t_sic