I am running a monte-carlo simulation in parallel using joblib
. I noticed however although my seeds were fixed my results kept changing. However, when I ran the process in series it remained constant as I expect.
Below I implement a small example, simulating the mean for a normal distribution with higher variance.
Load Libraries and define function
import numpy as np
from joblib import Parallel, delayed
def _estimate_mean():
np.random.seed(0)
x = np.random.normal(0, 2, size=100)
return np.mean(x)
The first example I implement in series - the results are all the same as expected.
tst = [_estimate_mean() for i in range(8)]
In [28]: tst
Out[28]:
[0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897]
The second example I implement in Parallel: (Note sometimes the means are all the same other times not)
tst = Parallel(n_jobs=-1, backend="threading")(delayed(_estimate_mean)() for i in range(8))
In [26]: tst
Out[26]:
[0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.1640259414956747,
-0.11846452111932627,
-0.3935934130918206]
I expect the parallel run to be the same as the seed is fixed. I found if I implement RandomState
to fix the seeds it seems to resolve the problem:
def _estimate_mean():
local_state = np.random.RandomState(0)
x = local_state.normal(0, 2, size=100)
return np.mean(x)
tst = Parallel(n_jobs=-1, backend="threading")(delayed(_estimate_mean)() for i in range(8))
In [28]: tst
Out[28]:
[0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897,
0.11961603106897]
What is the difference between using
RandomState
and justseed
when fixing the seeds usingnumpy.random
and why would the latter not reliably work when running in parallel ?
System Information
OS: Windows 10
Python: 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]
Numpy: 1.17.2
The result you're getting with numpy.random.*
is happening because of a race condition. numpy.random.*
uses only one global PRNG that is shared across all the threads without synchronization. Since the threads are running in parallel, at the same time, and their access to this global PRNG is not synchronized between them, they are all racing to access the PRNG state (so that the PRNG's state might change behind other threads' backs). Giving each thread its own PRNG (RandomState
) solves this problem because there is no longer any state that's shared by multiple threads without synchronization.
Since you're using NumPy 1.17, you should know that there is a better alternative: NumPy 1.17 introduces a new random number generation system; it uses so-called bit generators, such as PCG, and random generators, such as the new numpy.random.Generator
.
It was the result of a proposal to change the RNG policy, which states that numpy.random.*
functions should generally not be used anymore. This is especially because numpy.random.*
operates on global state.
The NumPy documentation now has detailed information on—
In the new RNG system. See also "Seed Generation for Noncryptographic PRNGs", from an article of mine with general advice on RNG selection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With