I'm having a problem where I'm getting different random numbers across different computers despite <ul> <li> <code>scipy.__version__ == '1.2.1'</code> on all computers</li> <li> <code>numpy.__version__ == '1.15.4'</code> on all computers</li> <li> <code>random_state</code> seed is fixed to the same number (42) in every function call that generates random numbers for reproducible results</li> </ul> The code is a bit to complex to post in full here, but I noticed results start to diverge specifically when sampling from a multivariate normal: <pre class="prettyprint"><code>import numpy as np from scipy import stats seed = 42 n_sim = 1000000 d = corr_mat.shape[0] # corr_mat is a 15x15 correlation matrix, numpy.ndarray # results diverge from here across different hardware z = stats.multivariate_normal(mean=np.zeros(d), cov=corr_mat).rvs(n_sim, random_state=seed) </code></pre> <code>corr_mat</code> is a correlation matrix (see Appendix below) and is the same across all computers. The two different computers we are testing on are <h3>Computer 1</h3> <hr> <ul> <li>OS: Windows 7</li> <li>Processor: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60Ghz 2.60 Ghz (2 processors)</li> <li>RAM: 64 GB</li> <li>System type: 64-bit</li> </ul> <hr> <h3>Computer 2</h3> <hr> <ul> <li>OS: Windows 7</li> <li>Processor: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.10Ghz 2.10 Ghz (2 processors)</li> <li>RAM: 64 GB</li> <li>System type: 64-bit</li> </ul> <hr> <h3>Appendix</h3> <pre class="prettyprint"><code>corr_mat >>> array([[1. , 0.15, 0.25, 0.25, 0.25, 0.25, 0.1 , 0.1 , 0.1 , 0.25, 0.25, 0.25, 0.1 , 0.1 , 0.1 ], [0.15, 1. , 0. , 0. , 0. , 0. , 0.15, 0.05, 0.15, 0.15, 0.15, 0. , 0.15, 0.15, 0.15], [0.25, 0. , 1. , 0.25, 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 , 0.25, 0.2 , 0.2 , 0.2 ], [0.25, 0. , 0.25, 1. , 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 , 0.25, 0.2 , 0.2 , 0.2 ], [0.25, 0. , 0.25, 0.25, 1. , 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 , 0.25, 0.2 , 0.2 , 0.2 ], [0.25, 0. , 0.25, 0.25, 0.25, 1. , 0.2 , 0. , 0.2 , 0.2 , 0.2 , 0.25, 0.2 , 0.2 , 0.2 ], [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 1. , 0.15, 0.25, 0.25, 0.25, 0.2 , 0.25, 0.25, 0.25], [0.1 , 0.05, 0. , 0. , 0. , 0. , 0.15, 1. , 0.15, 0.15, 0.15, 0. , 0.15, 0.15, 0.15], [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 1. , 0.25, 0.25, 0.2 , 0.25, 0.25, 0.25], [0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 1. , 0.25, 0.2 , 0.25, 0.25, 0.25], [0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 1. , 0.2 , 0.25, 0.25, 0.25], [0.25, 0. , 0.25, 0.25, 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 , 1. , 0.2 , 0.2 , 0.2 ], [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25, 0.2 , 1. , 0.25, 0.25], [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25, 0.2 , 0.25, 1. , 0.25], [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25, 0.2 , 0.25, 0.25, 1. ]]) </code></pre>

The following is an educated guess which I cannot validate since I don't have multiple machines. Sampling from a correlated multinormal is typically done by sampling from an uncorrelated standard normal and then multiplying with a "square root" of the covariance matrix. I get a fairly similar sample to the one scipy produces with seed set at 42 and your covariance matrix if I use instead <code>identity(15)</code> for the covariance and then multiply with <code>l*sqrt(d)</code> where <code>l,d,r = np.linalg.svd(covariance)</code> SVD is I suppose complex enough to explain small differences between platforms. How can this snowball into something significant? I think your choice of covariance matrix is to blame, since it has nonunique eigenvalues. As a consequence SVD is not unique, since eigenspaces to a given multiple eigenvalue can be rotated. This has the potential to hugely amplify a small numerical difference. It would be interesting to see whether the differences you see persist if you test with a different covariance matrix with unique eigenvalues. Edit: For reference, here is what i tried for your smaller (6D) example: <pre class="prettyprint"><code>>>> cm6 = np.array([[1,.5,.15,.15,0,0], [.5,1,.15,.15,0,0],[.15,.15,1,.25,0,0],[.15,.15,.25,1,0,0],[0,0,0,0,1,.1],[0,0,0,0,.1,1]]) >>> ls6,ds6,rs6 = np.linalg.svd(cm6) >>> np.random.seed(42) >>> cs6 = stats.multivariate_normal(cov=cm6).rvs() >>> np.random.seed(42) >>> is6 = stats.multivariate_normal(cov=np.identity(6)).rvs() >>> LS6 = ls6*np.sqrt(ds6) >>> np.allclose(cs6, LS6@is6) True </code></pre> As you report that the problem persists with unique eigenvalues here is one more possibility. Above I have used <code>svd</code> to compute eigen vectors / values which is ok since cov is symmetric. What happens if we use <code>eigh</code> instead? <pre class="prettyprint"><code>>>> de6,le6 = np.linalg.eigh(cm6) >>> LE6 = le6*np.sqrt(de6) >>> cs6 array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -0.91913994, 1.12421904]) >>> LE6@is6 array([ 0.54338614, 1.04010029, -0.71379193, -0.88313042, -0.60813547, 0.26082989]) </code></pre> These are different. Why? First, <code>eigh</code> orders the eigenspaces the other way round: <pre class="prettyprint"><code>>>> ds6 array([1.7 , 1.1 , 1.05, 0.9 , 0.75, 0.5 ]) >>> de6 array([0.5 , 0.75, 0.9 , 1.05, 1.1 , 1.7 ]) </code></pre> Does that fix it? Almost. <pre class="prettyprint"><code>>>> LE6[:, ::-1]@is6 array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -1.12421904, 0.91913994]) </code></pre> We see that the last two samples are swapped and their signs flipped. Turns out this is due to the sign of one eigen vector being inverted. So even for unique eigen values we can get large differences because of ambiguities in (1) the order of eigen spaces and (2) the sign of eigen vectors.

Does scipy.stats produce different random numbers for different computer hardware?

Tags:

I'm having a problem where I'm getting different random numbers across different computers despite

scipy.__version__ == '1.2.1' on all computers
numpy.__version__ == '1.15.4' on all computers
random_state seed is fixed to the same number (42) in every function call that generates random numbers for reproducible results

The code is a bit to complex to post in full here, but I noticed results start to diverge specifically when sampling from a multivariate normal:

import numpy as np
from scipy import stats
seed = 42
n_sim = 1000000
d = corr_mat.shape[0] # corr_mat is a 15x15 correlation matrix, numpy.ndarray
# results diverge from here across different hardware
z = stats.multivariate_normal(mean=np.zeros(d), cov=corr_mat).rvs(n_sim, random_state=seed)

corr_mat is a correlation matrix (see Appendix below) and is the same across all computers.

The two different computers we are testing on are

Computer 1

OS: Windows 7
Processor: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60Ghz 2.60 Ghz (2 processors)
RAM: 64 GB
System type: 64-bit

Computer 2

OS: Windows 7
Processor: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.10Ghz 2.10 Ghz (2 processors)
RAM: 64 GB
System type: 64-bit

Appendix

corr_mat
>>> array([[1.  , 0.15, 0.25, 0.25, 0.25, 0.25, 0.1 , 0.1 , 0.1 , 0.25, 0.25,
        0.25, 0.1 , 0.1 , 0.1 ],
       [0.15, 1.  , 0.  , 0.  , 0.  , 0.  , 0.15, 0.05, 0.15, 0.15, 0.15,
        0.  , 0.15, 0.15, 0.15],
       [0.25, 0.  , 1.  , 0.25, 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 1.  , 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 0.25, 1.  , 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.25, 0.  , 0.25, 0.25, 0.25, 1.  , 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        0.25, 0.2 , 0.2 , 0.2 ],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 1.  , 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.1 , 0.05, 0.  , 0.  , 0.  , 0.  , 0.15, 1.  , 0.15, 0.15, 0.15,
        0.  , 0.15, 0.15, 0.15],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 1.  , 0.25, 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 1.  , 0.25,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 1.  ,
        0.2 , 0.25, 0.25, 0.25],
       [0.25, 0.  , 0.25, 0.25, 0.25, 0.25, 0.2 , 0.  , 0.2 , 0.2 , 0.2 ,
        1.  , 0.2 , 0.2 , 0.2 ],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 1.  , 0.25, 0.25],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 1.  , 0.25],
       [0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
        0.2 , 0.25, 0.25, 1.  ]])

911

asked Mar 04 '19 13:03

PyRsquared

1 Answers

The following is an educated guess which I cannot validate since I don't have multiple machines.

Sampling from a correlated multinormal is typically done by sampling from an uncorrelated standard normal and then multiplying with a "square root" of the covariance matrix. I get a fairly similar sample to the one scipy produces with seed set at 42 and your covariance matrix if I use instead identity(15) for the covariance and then multiply with l*sqrt(d) where l,d,r = np.linalg.svd(covariance)

SVD is I suppose complex enough to explain small differences between platforms.

How can this snowball into something significant?

I think your choice of covariance matrix is to blame, since it has nonunique eigenvalues. As a consequence SVD is not unique, since eigenspaces to a given multiple eigenvalue can be rotated. This has the potential to hugely amplify a small numerical difference.

It would be interesting to see whether the differences you see persist if you test with a different covariance matrix with unique eigenvalues.

Edit:

For reference, here is what i tried for your smaller (6D) example:

>>> cm6 = np.array([[1,.5,.15,.15,0,0], [.5,1,.15,.15,0,0],[.15,.15,1,.25,0,0],[.15,.15,.25,1,0,0],[0,0,0,0,1,.1],[0,0,0,0,.1,1]])
>>> ls6,ds6,rs6 = np.linalg.svd(cm6)
>>> np.random.seed(42)
>>> cs6 = stats.multivariate_normal(cov=cm6).rvs()
>>> np.random.seed(42)
>>> is6 = stats.multivariate_normal(cov=np.identity(6)).rvs()
>>> LS6 = ls6*np.sqrt(ds6)
>>> np.allclose(cs6, LS6@is6)
True

As you report that the problem persists with unique eigenvalues here is one more possibility. Above I have used svd to compute eigen vectors / values which is ok since cov is symmetric. What happens if we use eigh instead?

>>> de6,le6 = np.linalg.eigh(cm6)
>>> LE6 = le6*np.sqrt(de6)
>>> cs6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -0.91913994,
        1.12421904])
>>> LE6@is6
array([ 0.54338614,  1.04010029, -0.71379193, -0.88313042, -0.60813547,
        0.26082989])

These are different. Why? First, eigh orders the eigenspaces the other way round:

>>> ds6
array([1.7 , 1.1 , 1.05, 0.9 , 0.75, 0.5 ])
>>> de6
array([0.5 , 0.75, 0.9 , 1.05, 1.1 , 1.7 ])

Does that fix it? Almost.

>>> LE6[:, ::-1]@is6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -1.12421904,
        0.91913994])

We see that the last two samples are swapped and their signs flipped. Turns out this is due to the sign of one eigen vector being inverted.

So even for unique eigen values we can get large differences because of ambiguities in (1) the order of eigen spaces and (2) the sign of eigen vectors.

answered Sep 29 '22 04:09

Paul Panzer

Related questions
                            
                                waveforms arent as smooth as they should be
                            
                                Way to add network share location in Microsoft Teams
                            
                                Appium hidekeyboard() method
                            
                                I want to change horizontal scrollbar css
                            
                                Bot framework V4 Nodejs chat history logging
                            
                                How can I perform data lineage in GCP? [closed]
                            
                                Using Databricks Connect
                            
                                Detecting a broken shape using OpenCV
                            
                                Steps to reduce time delay due to GC allocation failure in azure databricks
                            
                                Date parsing issue with Arabic and regular date in Java
                            
                                Active-Passive Jenkins Setup in Kubernetes
                            
                                Keeping a method with single parameter using proguard

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With