Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate np.cov on a matrix with np.nan values without converting to pd.DataFrame?

I have the following np.array:

my_matrix = np.array([[1,np.nan,3], [np.nan,1,2], [np.nan,1,2]])
array([[ 1., nan,  3.],
       [nan,  1.,  2.],
       [nan,  1.,  2.]])

If I evaluate np.cov on it, I get:

np.cov(my_matrix)
array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan]])

But if I were to calculate it with pd.DataFrame.cov I get a different result:

pd.DataFrame(my_matrix).cov()
    0   1   2
0   NaN NaN NaN
1   NaN 0.0 0.000000
2   NaN 0.0 0.333333

I know that as per pandas documentation, they handle nan values.

My question is, how can I get the same (or similar result) with numpy? Or how to handle missing data when calculating covariance with numpy?

like image 496
Newskooler Avatar asked Dec 12 '18 19:12

Newskooler


People also ask

How do you calculate COV in Python?

Example #1: Use cov() function to find the covariance between the columns of the dataframe. Note : Any non-numeric columns will be ignored. Output : Example #2: Use cov() function to find the covariance between the columns of the dataframe which are having NaN value.

How do you find the covariance matrix in pandas?

Pandas DataFrame: cov() function The cov() function is used to compute pairwise covariance of columns, excluding NA/null values. Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

What are NP NaN values?

In Python, NumPy NAN stands for not a number and is defined as a substitute for declaring value which are numerical values that are missing values in an array as NumPy is used to deal with arrays in Python and this can be initialized using numpy.


1 Answers

You can make use of Numpy's masked arrays.

import numpy.ma as ma
cv = ma.cov(ma.masked_invalid(my_matrix), rowvar=False)
cv
masked_array(
  data=[[--, --, --],
        [--, 0.0, 0.0],
        [--, 0.0, 0.33333333333333337]],
  mask=[[ True,  True,  True],
        [ True, False, False],
        [ True, False, False]],
  fill_value=1e+20)

To produce an ndarray with nan values filled in, use the filled method.

cv.filled(np.nan)
array([[       nan,        nan,        nan],
       [       nan, 0.        , 0.        ],
       [       nan, 0.        , 0.33333333]])

Note that np.cov produces pairwise row covariances by default. To replicate Pandas behavior (pairwise column covariances), you must pass rowvar=False to ma.cov.

like image 115
Igor Raush Avatar answered Oct 21 '22 17:10

Igor Raush