python - how to compute correlation-matrix with nans in data-matrix

Tags:

I coundn't find a function that computes a matrix of correlation coefficients for arrays containing observations for more than two variables when there are NaNs in the data. There are functions doing this for pairs of variables (or just masking the arrays using ~is.nan()). But using these functions by looping over a large number of variables, computing the correlation for each pair can be very time consuming.

So I tried on my own and soon realized that the complexity of doing this is a question of the proper normalization of the Covariance. I would be very interest in your opinions on how to do it.

Here is the code:

def nancorr(X,nanfact=False):
    X = X - np.nanmean(X,axis=1,keepdims = True)*np.ones((1,X.shape[1]))

    if nanfact:
        mask = np.isnan(X).astype(int)
        fact = X.shape[1] - np.dot(mask,mask.T) - 1    

    X[np.isnan(X)] = 0
    if nanfact:
        cov = np.dot(X,X.T)/fact
    else:
        cov = np.dot(X,X.T)

    d = np.diag(cov)
    return cov/np.sqrt(np.multiply.outer(d,d))

The function assumes that each row is a variable. It is basically an adjusted code from numpy's corrcoeff(). I believe there are three ways of doing this:

(1) For each pair of variables, you take only those observations for which neither one nor the other variable is NaN. This is arguably the most accurate, but also most difficult one to program if you want to do the computation for more than one pair simultaneously and not covered in the above code. Why, however, throw away information on the mean and variance of each variable, just because the corresponding entry of another variable is NaN? Hence, two other options.

(2) We demean each variable by it nanmean and the variance of each variable is its nanvariance. For the covariance, each observation where one or the other variable is NaN, but not both, is an observation of no-covariation and, therefore, set to zero. The factor of the covariance is then 1/(# of observation where not both variables are NaN - 1), denoted by n. Both variances in the denominator of the correlation coefficient are factored by their corresponding number of non-NaN observations minus 1, denoted by n1 and n2 respectively. This is achived by setting nanfact=True in the function above.

(3) One may wish that the covariance and the variances have the same factor as it is the case for correlation coefficient without NaNs. The only meaningful way to do this here (if option (1) is not feasable), is to simply ignore (1/n)/sqrt(1/n1*n2). Since this number is smaller than one, the estimated correlation coefficients will be larger (in absolute value) than in (2), but will remain between -1,1. This is achieved by setting nanfact=False.

I'd be very interested in your opinions on approaches (2) and (3) and especially, I would very much like to see a solution to (1) without the use of loops.

380

asked Nov 24 '14 11:11

user3820991

Video Answer

1 Answers

I think the method you are looking for is corr() from pandas. For example, a dataframe as following. You can also refer to this question. How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

import pandas as pd
df = pd.DataFrame({'A': [2, None, 1, -4, None, None, 3],
                   'B': [None, 1, None, None, 1, 3, None],
                   'C': [2, 1, None, 2, 2.1, 1, 0],
                   'D': [-2, 1.1, 3.2, 2, None, 1, None]})

df

    A       B       C       D
0   2       NaN     2       -2
1   NaN     1       1       1.1
2   1       NaN     NaN     3.2
3   -4      NaN     2       2
4   NaN     1       2.1     NaN
5   NaN     3       1       1
6   3       NaN     0       NaN

rho = df.corr()
rho

       A          B            C           D
A   1.000000     NaN       -0.609994    -0.441784
B   NaN          1.0       -0.500000    -1.000000
C   -0.609994    -0.5       1.000000    -0.347928
D   0.041204     -1.0       -0.347928    1.000000

108

answered Oct 25 '22 10:10

ju.

Related questions
                            
                                Pythonic way to split math calculations
                            
                                Solving a graph issue with Python
                            
                                How to use Python Mock to raise an exception - but with Errno set to a given value
                            
                                DRF - Method 'GET' not allowed
                            
                                Updating json field in Postgres
                            
                                PGP/GPG Signed Python code
                            
                                Signal handling in python-daemon
                            
                                Creating a scrolling panel in wxPython
                            
                                PyCharm 3.1 hangs forever during indexing and unusable
                            
                                streaming m3u8 file with opencv
                            
                                Python: Importing a module with the same name as a function
                            
                                How to return a relative URI Location header with Flask?
                            
                                Matplotlib tight_layout causing RuntimeError
                            
                                pip: Any workaround to avoid --allow-external?
                            
                                turning a two dimensional array into a two column dataframe pandas
                            
                                In SQLAlchemy, can I create an Engine from an existing ODBC connection?
                            
                                Distances between rankings
                            
                                PyCharm remote debugging - connects but can't start debugging
                            
                                Save numpy array as image with high precision (16 bits) with scikit-image
                            
                                how can I use selenium with my normal browser

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python - how to compute correlation-matrix with nans in data-matrix

Tags:

python

numpy

scipy

correlation

user3820991

People also ask

Video Answer

1 Answers

ju.

Recent Activity

Donate For Us