I am trying to compute a correlation matrix of several values. These values include some 'nan' values. I'm using numpy.corrcoef. For element(i,j) of the output correlation matrix I'd like to have the correlation calculated using all values that exist for both variable i and variable j.
This is what I have now:
In[20]: df_counties = pd.read_sql("SELECT Median_Age, Rpercent_2008, overall_LS, population_density FROM countyVotingSM2", db_eng) In[21]: np.corrcoef(df_counties, rowvar = False) Out[21]: array([[ 1. , nan, nan, -0.10998411], [ nan, nan, nan, nan], [ nan, nan, nan, nan], [-0.10998411, nan, nan, 1. ]])
Too many nan's :(
corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.
Method 1: Creating a correlation matrix using Numpy libraryNumpy library make use of corrcoef() function that returns a matrix of 2×2. The matrix consists of correlations of x with x (0,0), x with y (0,1), y with x (1,0) and y with y (1,1).
In NumPy, We can compute pearson product-moment correlation coefficients of two given arrays with the help of numpy. corrcoef() function. In this function, we will pass arrays as a parameter and it will return the pearson product-moment correlation coefficients of two given arrays.
One of the main features of pandas
is being NaN
friendly. To calculate correlation matrix, simply call df_counties.corr()
. Below is an example to demonstrate df.corr()
is NaN
tolerant whereas np.corrcoef
is not.
import pandas as pd import numpy as np # data # ============================== np.random.seed(0) df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE')) df[df < 0] = np.nan df A B C D E 0 1.7641 0.4002 0.9787 2.2409 1.8676 1 NaN 0.9501 NaN NaN 0.4106 2 0.1440 1.4543 0.7610 0.1217 0.4439 3 0.3337 1.4941 NaN 0.3131 NaN 4 NaN 0.6536 0.8644 NaN 2.2698 5 NaN 0.0458 NaN 1.5328 1.4694 6 0.1549 0.3782 NaN NaN NaN 7 0.1563 1.2303 1.2024 NaN NaN 8 NaN NaN NaN 1.9508 NaN 9 NaN NaN 0.7775 NaN NaN .. ... ... ... ... ... 90 NaN 0.8202 0.4631 0.2791 0.3389 91 2.0210 NaN NaN 0.1993 NaN 92 NaN NaN NaN 0.1813 NaN 93 2.4125 NaN NaN NaN 0.2515 94 NaN NaN NaN NaN 1.7389 95 0.9944 1.3191 NaN 1.1286 0.4960 96 0.7714 1.0294 NaN NaN 0.8626 97 NaN 1.5133 0.5531 NaN 0.2205 98 NaN NaN 1.1003 1.2980 2.6962 99 NaN NaN NaN NaN NaN [100 rows x 5 columns] # calculations # ================================ df.corr() A B C D E A 1.0000 0.2718 0.2678 0.2822 0.1016 B 0.2718 1.0000 -0.0692 0.1736 -0.1432 C 0.2678 -0.0692 1.0000 -0.3392 0.0012 D 0.2822 0.1736 -0.3392 1.0000 0.1562 E 0.1016 -0.1432 0.0012 0.1562 1.0000 np.corrcoef(df, rowvar=False) array([[ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With