Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy corrcoef - compute correlation matrix while ignoring missing data

Tags:

I am trying to compute a correlation matrix of several values. These values include some 'nan' values. I'm using numpy.corrcoef. For element(i,j) of the output correlation matrix I'd like to have the correlation calculated using all values that exist for both variable i and variable j.

This is what I have now:

In[20]: df_counties = pd.read_sql("SELECT Median_Age, Rpercent_2008, overall_LS, population_density FROM countyVotingSM2", db_eng) In[21]: np.corrcoef(df_counties, rowvar = False) Out[21]:  array([[ 1.        ,         nan,         nan, -0.10998411],        [        nan,         nan,         nan,         nan],        [        nan,         nan,         nan,         nan],        [-0.10998411,         nan,         nan,  1.        ]]) 

Too many nan's :(

like image 821
Selah Avatar asked Jul 24 '15 20:07

Selah


People also ask

Does pandas Corr ignore NaN?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.

How do you make a correlation matrix in Python?

Method 1: Creating a correlation matrix using Numpy libraryNumpy library make use of corrcoef() function that returns a matrix of 2×2. The matrix consists of correlations of x with x (0,0), x with y (0,1), y with x (1,0) and y with y (1,1).

How does Numpy Corrcoef work?

In NumPy, We can compute pearson product-moment correlation coefficients of two given arrays with the help of numpy. corrcoef() function. In this function, we will pass arrays as a parameter and it will return the pearson product-moment correlation coefficients of two given arrays.


1 Answers

One of the main features of pandas is being NaN friendly. To calculate correlation matrix, simply call df_counties.corr(). Below is an example to demonstrate df.corr() is NaN tolerant whereas np.corrcoef is not.

import pandas as pd import numpy as np  # data # ============================== np.random.seed(0) df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE')) df[df < 0] = np.nan df           A       B       C       D       E 0   1.7641  0.4002  0.9787  2.2409  1.8676 1      NaN  0.9501     NaN     NaN  0.4106 2   0.1440  1.4543  0.7610  0.1217  0.4439 3   0.3337  1.4941     NaN  0.3131     NaN 4      NaN  0.6536  0.8644     NaN  2.2698 5      NaN  0.0458     NaN  1.5328  1.4694 6   0.1549  0.3782     NaN     NaN     NaN 7   0.1563  1.2303  1.2024     NaN     NaN 8      NaN     NaN     NaN  1.9508     NaN 9      NaN     NaN  0.7775     NaN     NaN ..     ...     ...     ...     ...     ... 90     NaN  0.8202  0.4631  0.2791  0.3389 91  2.0210     NaN     NaN  0.1993     NaN 92     NaN     NaN     NaN  0.1813     NaN 93  2.4125     NaN     NaN     NaN  0.2515 94     NaN     NaN     NaN     NaN  1.7389 95  0.9944  1.3191     NaN  1.1286  0.4960 96  0.7714  1.0294     NaN     NaN  0.8626 97     NaN  1.5133  0.5531     NaN  0.2205 98     NaN     NaN  1.1003  1.2980  2.6962 99     NaN     NaN     NaN     NaN     NaN  [100 rows x 5 columns]  # calculations # ================================ df.corr()          A       B       C       D       E A  1.0000  0.2718  0.2678  0.2822  0.1016 B  0.2718  1.0000 -0.0692  0.1736 -0.1432 C  0.2678 -0.0692  1.0000 -0.3392  0.0012 D  0.2822  0.1736 -0.3392  1.0000  0.1562 E  0.1016 -0.1432  0.0012  0.1562  1.0000   np.corrcoef(df, rowvar=False)  array([[ nan,  nan,  nan,  nan,  nan],        [ nan,  nan,  nan,  nan,  nan],        [ nan,  nan,  nan,  nan,  nan],        [ nan,  nan,  nan,  nan,  nan],        [ nan,  nan,  nan,  nan,  nan]]) 
like image 54
Jianxun Li Avatar answered Sep 21 '22 05:09

Jianxun Li