Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is numpy random seed not remaining fixed but RandomState is when run in parallel?

I am running a monte-carlo simulation in parallel using joblib. I noticed however although my seeds were fixed my results kept changing. However, when I ran the process in series it remained constant as I expect.

Below I implement a small example, simulating the mean for a normal distribution with higher variance.

Load Libraries and define function

import numpy as np
from joblib import Parallel, delayed

def _estimate_mean():
    np.random.seed(0)
    x = np.random.normal(0, 2, size=100)
    return np.mean(x)

The first example I implement in series - the results are all the same as expected.

tst = [_estimate_mean() for i in range(8)]
In [28]: tst
Out[28]:
[0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897]

The second example I implement in Parallel: (Note sometimes the means are all the same other times not)

tst = Parallel(n_jobs=-1, backend="threading")(delayed(_estimate_mean)() for i in range(8))

In [26]: tst
Out[26]:
[0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.1640259414956747,
 -0.11846452111932627,
 -0.3935934130918206]

I expect the parallel run to be the same as the seed is fixed. I found if I implement RandomState to fix the seeds it seems to resolve the problem:

def _estimate_mean():
    local_state = np.random.RandomState(0)
    x = local_state.normal(0, 2, size=100)
    return np.mean(x)
tst = Parallel(n_jobs=-1, backend="threading")(delayed(_estimate_mean)() for i in range(8))

In [28]: tst
Out[28]:
[0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897,
 0.11961603106897]

What is the difference between using RandomState and just seed when fixing the seeds using numpy.random and why would the latter not reliably work when running in parallel ?

System Information

OS: Windows 10

Python: 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]

Numpy: 1.17.2

like image 712
RK1 Avatar asked Nov 29 '19 13:11

RK1


1 Answers

The result you're getting with numpy.random.* is happening because of a race condition. numpy.random.* uses only one global PRNG that is shared across all the threads without synchronization. Since the threads are running in parallel, at the same time, and their access to this global PRNG is not synchronized between them, they are all racing to access the PRNG state (so that the PRNG's state might change behind other threads' backs). Giving each thread its own PRNG (RandomState) solves this problem because there is no longer any state that's shared by multiple threads without synchronization.


Since you're using NumPy 1.17, you should know that there is a better alternative: NumPy 1.17 introduces a new random number generation system; it uses so-called bit generators, such as PCG, and random generators, such as the new numpy.random.Generator.

It was the result of a proposal to change the RNG policy, which states that numpy.random.* functions should generally not be used anymore. This is especially because numpy.random.* operates on global state.

The NumPy documentation now has detailed information on—

  • seeding RNGs in parallel, and
  • multithreading RNGs,

In the new RNG system. See also "Seed Generation for Noncryptographic PRNGs", from an article of mine with general advice on RNG selection.

like image 187
Peter O. Avatar answered Oct 12 '22 14:10

Peter O.