Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python - how to compute correlation-matrix with nans in data-matrix

I coundn't find a function that computes a matrix of correlation coefficients for arrays containing observations for more than two variables when there are NaNs in the data. There are functions doing this for pairs of variables (or just masking the arrays using ~is.nan()). But using these functions by looping over a large number of variables, computing the correlation for each pair can be very time consuming.

So I tried on my own and soon realized that the complexity of doing this is a question of the proper normalization of the Covariance. I would be very interest in your opinions on how to do it.

Here is the code:

def nancorr(X,nanfact=False):
    X = X - np.nanmean(X,axis=1,keepdims = True)*np.ones((1,X.shape[1]))

    if nanfact:
        mask = np.isnan(X).astype(int)
        fact = X.shape[1] - np.dot(mask,mask.T) - 1    

    X[np.isnan(X)] = 0
    if nanfact:
        cov = np.dot(X,X.T)/fact
    else:
        cov = np.dot(X,X.T)

    d = np.diag(cov)
    return cov/np.sqrt(np.multiply.outer(d,d))

The function assumes that each row is a variable. It is basically an adjusted code from numpy's corrcoeff(). I believe there are three ways of doing this:

(1) For each pair of variables, you take only those observations for which neither one nor the other variable is NaN. This is arguably the most accurate, but also most difficult one to program if you want to do the computation for more than one pair simultaneously and not covered in the above code. Why, however, throw away information on the mean and variance of each variable, just because the corresponding entry of another variable is NaN? Hence, two other options.

(2) We demean each variable by it nanmean and the variance of each variable is its nanvariance. For the covariance, each observation where one or the other variable is NaN, but not both, is an observation of no-covariation and, therefore, set to zero. The factor of the covariance is then 1/(# of observation where not both variables are NaN - 1), denoted by n. Both variances in the denominator of the correlation coefficient are factored by their corresponding number of non-NaN observations minus 1, denoted by n1 and n2 respectively. This is achived by setting nanfact=True in the function above.

(3) One may wish that the covariance and the variances have the same factor as it is the case for correlation coefficient without NaNs. The only meaningful way to do this here (if option (1) is not feasable), is to simply ignore (1/n)/sqrt(1/n1*n2). Since this number is smaller than one, the estimated correlation coefficients will be larger (in absolute value) than in (2), but will remain between -1,1. This is achieved by setting nanfact=False.

I'd be very interested in your opinions on approaches (2) and (3) and especially, I would very much like to see a solution to (1) without the use of loops.

like image 380
user3820991 Avatar asked Nov 24 '14 11:11

user3820991


People also ask

How do you find the correlation matrix of a dataset in Python?

Numpy library make use of corrcoef() function that returns a matrix of 2×2. The matrix consists of correlations of x with x (0,0), x with y (0,1), y with x (1,0) and y with y (1,1). We are only concerned with the correlation of x with y i.e. cell (0,1) or (1,0). See below for an example.

Does pandas Corr ignore NaN?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.

How do you analyze a correlation matrix in Python?

Interpreting the correlation matrixEach cell in the grid represents the value of the correlation coefficient between two variables. It is a square matrix – each row represents a variable, and all the columns represent the same variables as rows, hence the number of rows = number of columns.

How do you find the correlation between two matrices?

To compute the cross-correlation of two matrices, compute and sum the element-by-element products for every offset of the second matrix relative to the first. With several caveats, this can be used to calculate the offset required to get 2 matrices of related values to overlap.


Video Answer


1 Answers

I think the method you are looking for is corr() from pandas. For example, a dataframe as following. You can also refer to this question. How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

import pandas as pd
df = pd.DataFrame({'A': [2, None, 1, -4, None, None, 3],
                   'B': [None, 1, None, None, 1, 3, None],
                   'C': [2, 1, None, 2, 2.1, 1, 0],
                   'D': [-2, 1.1, 3.2, 2, None, 1, None]})

df
    A       B       C       D
0   2       NaN     2       -2
1   NaN     1       1       1.1
2   1       NaN     NaN     3.2
3   -4      NaN     2       2
4   NaN     1       2.1     NaN
5   NaN     3       1       1
6   3       NaN     0       NaN
rho = df.corr()
rho
       A          B            C           D
A   1.000000     NaN       -0.609994    -0.441784
B   NaN          1.0       -0.500000    -1.000000
C   -0.609994    -0.5       1.000000    -0.347928
D   0.041204     -1.0       -0.347928    1.000000
like image 108
ju. Avatar answered Oct 25 '22 10:10

ju.