I'm having a problem where I'm getting different random numbers across different computers despite
scipy.__version__ == '1.2.1'
on all computersnumpy.__version__ == '1.15.4'
on all computersrandom_state
seed is fixed to the same number (42) in every function call that generates random numbers for reproducible resultsThe code is a bit to complex to post in full here, but I noticed results start to diverge specifically when sampling from a multivariate normal:
import numpy as np
from scipy import stats
seed = 42
n_sim = 1000000
d = corr_mat.shape[0] # corr_mat is a 15x15 correlation matrix, numpy.ndarray
# results diverge from here across different hardware
z = stats.multivariate_normal(mean=np.zeros(d), cov=corr_mat).rvs(n_sim, random_state=seed)
corr_mat
is a correlation matrix (see Appendix below) and is the same across all computers.
The two different computers we are testing on are
corr_mat
>>> array([[1. , 0.15, 0.25, 0.25, 0.25, 0.25, 0.1 , 0.1 , 0.1 , 0.25, 0.25,
0.25, 0.1 , 0.1 , 0.1 ],
[0.15, 1. , 0. , 0. , 0. , 0. , 0.15, 0.05, 0.15, 0.15, 0.15,
0. , 0.15, 0.15, 0.15],
[0.25, 0. , 1. , 0.25, 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
0.25, 0.2 , 0.2 , 0.2 ],
[0.25, 0. , 0.25, 1. , 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
0.25, 0.2 , 0.2 , 0.2 ],
[0.25, 0. , 0.25, 0.25, 1. , 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
0.25, 0.2 , 0.2 , 0.2 ],
[0.25, 0. , 0.25, 0.25, 0.25, 1. , 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
0.25, 0.2 , 0.2 , 0.2 ],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 1. , 0.15, 0.25, 0.25, 0.25,
0.2 , 0.25, 0.25, 0.25],
[0.1 , 0.05, 0. , 0. , 0. , 0. , 0.15, 1. , 0.15, 0.15, 0.15,
0. , 0.15, 0.15, 0.15],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 1. , 0.25, 0.25,
0.2 , 0.25, 0.25, 0.25],
[0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 1. , 0.25,
0.2 , 0.25, 0.25, 0.25],
[0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 1. ,
0.2 , 0.25, 0.25, 0.25],
[0.25, 0. , 0.25, 0.25, 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
1. , 0.2 , 0.2 , 0.2 ],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
0.2 , 1. , 0.25, 0.25],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
0.2 , 0.25, 1. , 0.25],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
0.2 , 0.25, 0.25, 1. ]])
This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more.
stats module is used to generate a random sample of any size from poisson distribution. The norm. rvs() method from the scipy. stats module can be used to generate a random sample of any size from Normal Distribution.
Random variates of given type. The shape parameter(s) for the distribution (see docstring of the instance object for more information).
The function skewtest can be used to determine if the skewness value is close enough to zero, statistically speaking.
The following is an educated guess which I cannot validate since I don't have multiple machines.
Sampling from a correlated multinormal is typically done by sampling from an uncorrelated standard normal and then multiplying with a "square root" of the covariance matrix. I get a fairly similar sample to the one scipy produces with seed set at 42 and your covariance matrix if I use instead identity(15)
for the covariance and then multiply with l*sqrt(d)
where l,d,r = np.linalg.svd(covariance)
SVD is I suppose complex enough to explain small differences between platforms.
How can this snowball into something significant?
I think your choice of covariance matrix is to blame, since it has nonunique eigenvalues. As a consequence SVD is not unique, since eigenspaces to a given multiple eigenvalue can be rotated. This has the potential to hugely amplify a small numerical difference.
It would be interesting to see whether the differences you see persist if you test with a different covariance matrix with unique eigenvalues.
Edit:
For reference, here is what i tried for your smaller (6D) example:
>>> cm6 = np.array([[1,.5,.15,.15,0,0], [.5,1,.15,.15,0,0],[.15,.15,1,.25,0,0],[.15,.15,.25,1,0,0],[0,0,0,0,1,.1],[0,0,0,0,.1,1]])
>>> ls6,ds6,rs6 = np.linalg.svd(cm6)
>>> np.random.seed(42)
>>> cs6 = stats.multivariate_normal(cov=cm6).rvs()
>>> np.random.seed(42)
>>> is6 = stats.multivariate_normal(cov=np.identity(6)).rvs()
>>> LS6 = ls6*np.sqrt(ds6)
>>> np.allclose(cs6, LS6@is6)
True
As you report that the problem persists with unique eigenvalues here is one more possibility. Above I have used svd
to compute eigen vectors / values which is ok since cov is symmetric. What happens if we use eigh
instead?
>>> de6,le6 = np.linalg.eigh(cm6)
>>> LE6 = le6*np.sqrt(de6)
>>> cs6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -0.91913994,
1.12421904])
>>> LE6@is6
array([ 0.54338614, 1.04010029, -0.71379193, -0.88313042, -0.60813547,
0.26082989])
These are different. Why? First, eigh
orders the eigenspaces the other way round:
>>> ds6
array([1.7 , 1.1 , 1.05, 0.9 , 0.75, 0.5 ])
>>> de6
array([0.5 , 0.75, 0.9 , 1.05, 1.1 , 1.7 ])
Does that fix it? Almost.
>>> LE6[:, ::-1]@is6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -1.12421904,
0.91913994])
We see that the last two samples are swapped and their signs flipped. Turns out this is due to the sign of one eigen vector being inverted.
So even for unique eigen values we can get large differences because of ambiguities in (1) the order of eigen spaces and (2) the sign of eigen vectors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With