Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does scipy.stats produce different random numbers for different computer hardware?

Tags:

I'm having a problem where I'm getting different random numbers across different computers despite

  • scipy.__version__ == '1.2.1' on all computers
  • numpy.__version__ == '1.15.4' on all computers
  • random_state seed is fixed to the same number (42) in every function call that generates random numbers for reproducible results

The code is a bit to complex to post in full here, but I noticed results start to diverge specifically when sampling from a multivariate normal:

import numpy as np
from scipy import stats
seed = 42
n_sim = 1000000
d = corr_mat.shape[0] # corr_mat is a 15x15 correlation matrix, numpy.ndarray
# results diverge from here across different hardware
z = stats.multivariate_normal(mean=np.zeros(d), cov=corr_mat).rvs(n_sim, random_state=seed)

corr_mat is a correlation matrix (see Appendix below) and is the same across all computers.

The two different computers we are testing on are

Computer 1


  • OS: Windows 7
  • Processor: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60Ghz 2.60 Ghz (2 processors)
  • RAM: 64 GB
  • System type: 64-bit

Computer 2


  • OS: Windows 7
  • Processor: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.10Ghz 2.10 Ghz (2 processors)
  • RAM: 64 GB
  • System type: 64-bit

Appendix

corr_mat
>>> array([[1.  , 0.15, 0.25, 0.25, 0.25, 0.25, 0.1 , 0.1 , 0.1 , 0.25, 0.25,
        0.25, 0.1 , 0.1 , 0.1 ],
       [0.15, 1.  , 0.  , 0.  , 0.  , 0.  , 0.15, 0.05, 0.15, 0.15, 0.15,
        0.  , 0.15, 0.15, 0.15],
       [0.25, 0.  , 1.  , 0.25, 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 1.  , 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 0.25, 1.  , 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 0.25, 0.25, 1.  , 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 1.  , 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.1 , 0.05, 0.  , 0.  , 0.  , 0.  , 0.15, 1.  , 0.15, 0.15, 0.15,
        0.  , 0.15, 0.15, 0.15],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 1.  , 0.25, 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 1.  , 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 1.  ,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.  , 0.25, 0.25, 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        1.  , 0.2 , 0.2 , 0.2 ],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 1.  , 0.25, 0.25],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 1.  , 0.25],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 0.25, 1.  ]])
like image 911
PyRsquared Avatar asked Mar 04 '19 13:03

PyRsquared


People also ask

What is Scipy stats in Python?

This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more.

Which of the following method is used to generate random numbers for a defined distribution available in Scipy stats module?

stats module is used to generate a random sample of any size from poisson distribution. The norm. rvs() method from the scipy. stats module can be used to generate a random sample of any size from Normal Distribution.

What is RVS in Scipy stats?

Random variates of given type. The shape parameter(s) for the distribution (see docstring of the instance object for more information).

Which of the following method of Scipy stats module is used to determine the skewness of a distribution?

The function skewtest can be used to determine if the skewness value is close enough to zero, statistically speaking.


1 Answers

The following is an educated guess which I cannot validate since I don't have multiple machines.

Sampling from a correlated multinormal is typically done by sampling from an uncorrelated standard normal and then multiplying with a "square root" of the covariance matrix. I get a fairly similar sample to the one scipy produces with seed set at 42 and your covariance matrix if I use instead identity(15) for the covariance and then multiply with l*sqrt(d) where l,d,r = np.linalg.svd(covariance)

SVD is I suppose complex enough to explain small differences between platforms.

How can this snowball into something significant?

I think your choice of covariance matrix is to blame, since it has nonunique eigenvalues. As a consequence SVD is not unique, since eigenspaces to a given multiple eigenvalue can be rotated. This has the potential to hugely amplify a small numerical difference.

It would be interesting to see whether the differences you see persist if you test with a different covariance matrix with unique eigenvalues.

Edit:

For reference, here is what i tried for your smaller (6D) example:

>>> cm6 = np.array([[1,.5,.15,.15,0,0], [.5,1,.15,.15,0,0],[.15,.15,1,.25,0,0],[.15,.15,.25,1,0,0],[0,0,0,0,1,.1],[0,0,0,0,.1,1]])
>>> ls6,ds6,rs6 = np.linalg.svd(cm6)
>>> np.random.seed(42)
>>> cs6 = stats.multivariate_normal(cov=cm6).rvs()
>>> np.random.seed(42)
>>> is6 = stats.multivariate_normal(cov=np.identity(6)).rvs()
>>> LS6 = ls6*np.sqrt(ds6)
>>> np.allclose(cs6, LS6@is6)
True

As you report that the problem persists with unique eigenvalues here is one more possibility. Above I have used svd to compute eigen vectors / values which is ok since cov is symmetric. What happens if we use eigh instead?

>>> de6,le6 = np.linalg.eigh(cm6)
>>> LE6 = le6*np.sqrt(de6)
>>> cs6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -0.91913994,
        1.12421904])
>>> LE6@is6
array([ 0.54338614,  1.04010029, -0.71379193, -0.88313042, -0.60813547,
        0.26082989])

These are different. Why? First, eigh orders the eigenspaces the other way round:

>>> ds6
array([1.7 , 1.1 , 1.05, 0.9 , 0.75, 0.5 ])
>>> de6
array([0.5 , 0.75, 0.9 , 1.05, 1.1 , 1.7 ])

Does that fix it? Almost.

>>> LE6[:, ::-1]@is6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -1.12421904,
        0.91913994])

We see that the last two samples are swapped and their signs flipped. Turns out this is due to the sign of one eigen vector being inverted.

So even for unique eigen values we can get large differences because of ambiguities in (1) the order of eigen spaces and (2) the sign of eigen vectors.

like image 95
Paul Panzer Avatar answered Sep 29 '22 04:09

Paul Panzer