I have the following <code>np.array</code>: <pre class="prettyprint"><code>my_matrix = np.array([[1,np.nan,3], [np.nan,1,2], [np.nan,1,2]]) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>array([[ 1., nan, 3.], [nan, 1., 2.], [nan, 1., 2.]]) </code></pre> If I evaluate <code>np.cov</code> on it, I get: <pre class="prettyprint"><code>np.cov(my_matrix) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>array([[nan, nan, nan], [nan, nan, nan], [nan, nan, nan]]) </code></pre> But if I were to calculate it with <code>pd.DataFrame.cov</code> I get a different result: <pre class="prettyprint"><code>pd.DataFrame(my_matrix).cov() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code> 0 1 2 0 NaN NaN NaN 1 NaN 0.0 0.000000 2 NaN 0.0 0.333333 </code></pre> I know that as per <code>pandas</code> documentation, they handle <code>nan</code> values. My question is, how can I get the same (or similar result) with <code>numpy</code>? Or how to handle missing data when calculating covariance with <code>numpy</code>?

You can make use of Numpy's masked arrays. <pre class="prettyprint"><code>import numpy.ma as ma cv = ma.cov(ma.masked_invalid(my_matrix), rowvar=False) cv </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>masked_array( data=[[--, --, --], [--, 0.0, 0.0], [--, 0.0, 0.33333333333333337]], mask=[[ True, True, True], [ True, False, False], [ True, False, False]], fill_value=1e+20) </code></pre> To produce an <code>ndarray</code> with <code>nan</code> values filled in, use the <code>filled</code> method. <pre class="prettyprint"><code>cv.filled(np.nan) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>array([[ nan, nan, nan], [ nan, 0. , 0. ], [ nan, 0. , 0.33333333]]) </code></pre> <hr> Note that <code>np.cov</code> produces pairwise row covariances by default. To replicate Pandas behavior (pairwise column covariances), you must pass <code>rowvar=False</code> to <code>ma.cov</code>.

How to calculate np.cov on a matrix with np.nan values without converting to pd.DataFrame?

Tags:

python

python-3.x

pandas

numpy

covariance

I have the following np.array:

my_matrix = np.array([[1,np.nan,3], [np.nan,1,2], [np.nan,1,2]])

array([[ 1., nan,  3.],
       [nan,  1.,  2.],
       [nan,  1.,  2.]])

If I evaluate np.cov on it, I get:

np.cov(my_matrix)

array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan]])

But if I were to calculate it with pd.DataFrame.cov I get a different result:

pd.DataFrame(my_matrix).cov()

    0   1   2
0   NaN NaN NaN
1   NaN 0.0 0.000000
2   NaN 0.0 0.333333

I know that as per pandas documentation, they handle nan values.

My question is, how can I get the same (or similar result) with numpy? Or how to handle missing data when calculating covariance with numpy?

496

asked Dec 12 '18 19:12

Newskooler

1 Answers

You can make use of Numpy's masked arrays.

import numpy.ma as ma
cv = ma.cov(ma.masked_invalid(my_matrix), rowvar=False)
cv

masked_array(
  data=[[--, --, --],
        [--, 0.0, 0.0],
        [--, 0.0, 0.33333333333333337]],
  mask=[[ True,  True,  True],
        [ True, False, False],
        [ True, False, False]],
  fill_value=1e+20)

To produce an ndarray with nan values filled in, use the filled method.

cv.filled(np.nan)

array([[       nan,        nan,        nan],
       [       nan, 0.        , 0.        ],
       [       nan, 0.        , 0.33333333]])

Note that np.cov produces pairwise row covariances by default. To replicate Pandas behavior (pairwise column covariances), you must pass rowvar=False to ma.cov.

115

answered Oct 21 '22 17:10

Igor Raush

Related questions
                            
                                Preserve variable names in summary from statsmodels
                            
                                Unable to stream frames from camera to QML
                            
                                Efficiently return the index of the first value satisfying condition in array
                            
                                How can I find the most frequent two-column combination in a dataframe in python
                            
                                List of maximum values of columns in a matrix (without Numpy)
                            
                                Django GIS : Using location__dwithin gives "Only numeric values of degree units are allowed" however location__distance_lte works fine
                            
                                Pandas merge_asof on multiple columns
                            
                                Order of sess.run([op1, op2...]) in Tensorflow
                            
                                How to get color image from point grey camera with Spinnaker in python?
                            
                                Python & Sqlalchemy - Connection pattern -> Disconnected from the remote server randomly
                            
                                add axis lines to matplotlib plot
                            
                                Pandas drop duplicates within groupby [duplicate]
                            
                                How do I increase the padding on my pandas dataframe plot? [duplicate]
                            
                                Migrate anaconda from python v3.6 to v3.7 and preserve all conda and pip packages
                            
                                inspect.signature with PEP 563
                            
                                How can I find out / print with which version of the protocol a pickle file has been generated
                            
                                Fitting sklearn GridSearchCV model
                            
                                I am so confused about Object in JavaScript
                            
                                Machine learning odd/even prediction doesn't work (50% success)
                            
                                Speed up computation for Distance Transform on Image in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With