I have very simple cases where the work to be done can be broken up and distributed among workers. I tried a very simple multiprocessing example from here:
import multiprocessing
import numpy as np
import time
def do_calculation(data):
rand=np.random.randint(10)
print data, rand
time.sleep(rand)
return data * 2
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size)
inputs = list(range(10))
print 'Input :', inputs
pool_outputs = pool.map(do_calculation, inputs)
print 'Pool :', pool_outputs
The above program produces the following output :
Input : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0 7
1 7
2 7
5 7
3 7
4 7
6 7
7 7
8 6
9 6
Pool : [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
Why is the same random number getting printed? (I have 4 cpus in my machine). Is this the best/simplest way to go ahead?
Synchronization between processes Multiprocessing is a package which supports spawning processes using an API. This package is used for both local and remote concurrencies. Using this module, programmer can use multiple processors on a given machine. It runs on Windows and UNIX os.
Multiprocessing refers to the ability of a system to support more than one processor at the same time. Applications in a multiprocessing system are broken to smaller routines that run independently. The operating system allocates these threads to the processors improving performance of the system.
As we have seen, the Process allocates all the tasks in memory and Pool allocates only executing processes in memory, so when the task numbers is large, we can use Pool and when the task number is small, we can use Process class.
multiprocessing. dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module. That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs.
I think you'll need to re-seed the random number generator using numpy.random.seed in your do_calculation
function.
My guess is that the random number generator (RNG) gets seeded when you import the module. Then, when you use multiprocessing, you fork the current process with the RNG already seeded -- Thus, all your processes are sharing the same seed value for the RNG and so they'll generate the same sequences of numbers.
e.g.:
def do_calculation(data):
np.random.seed()
rand=np.random.randint(10)
print data, rand
return data * 2
This blog post provides an example of a good and bad practise when using numpy.random and multi-processing. The more important is to understand when the seed of your pseudo random number generator (PRNG) is created:
import numpy as np
import pprint
from multiprocessing import Pool
pp = pprint.PrettyPrinter()
def bad_practice(index):
return np.random.randint(0,10,size=10)
def good_practice(index):
return np.random.RandomState().randint(0,10,size=10)
p = Pool(5)
pp.pprint("Bad practice: ")
pp.pprint(p.map(bad_practice, range(5)))
pp.pprint("Good practice: ")
pp.pprint(p.map(good_practice, range(5)))
output:
'Bad practice: '
[array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9]),
array([4, 2, 8, 0, 1, 1, 6, 1, 2, 9])]
'Good practice: '
[array([8, 9, 4, 5, 1, 0, 8, 1, 5, 4]),
array([5, 1, 3, 3, 3, 0, 0, 1, 0, 8]),
array([1, 9, 9, 9, 2, 9, 4, 3, 2, 1]),
array([4, 3, 6, 2, 6, 1, 2, 9, 5, 2]),
array([6, 3, 5, 9, 7, 1, 7, 4, 8, 5])]
In the good practice the seed is created once per thread while in the bad practise the seed is created only once when you import the numpy.random module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With