Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Some of my columns get missing when I use df.corr in Pandas

Here is my code:


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('death_regression2.csv')
data3 = data.replace(r'\s+', np.nan, regex = True)  


plt.figure(figsize=(90,90)) 
corr = data3.corr()

print(np.shape(list(corr)))
print(np.shape(data3))

(135,) (4909, 204)

So before I use the correlation function, the total number of parameters was 204(number of the columns) but after using data3.corr(), some parameters go missing, reduced to 135.

How do check the correlation between all columns in the data?

like image 349
Heean Avatar asked Mar 04 '19 09:03

Heean


People also ask

What does DF Corr () do?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.

How do I get all the columns in pandas?

If you want to see the all columns in Pandas df. head(), then use this snippet before running your code. All column data will be visible.

How does pandas Corr deal with NaN?

Pandas will ignore the pairwise correlation if it has NaN value in one of the observations. We can verify that by removing the those values and checking the results.


1 Answers

Without seeing any additional data to understand why you are missing columns, we will have to inspect what pd.DataFrame.corr does.

As the documentation outlines it computes the pairwise correlations of columns. Because you specified no arguments is uses the default method and calculate Pearson's r, which measures the linear correlation between two variables (X, Y) and can take values between -1 and 1 corresponding to an exact negative linear correlation to an exact positive linear correlation and all the values in between, with 0 being no correlation (i.e., the plot of X against Y is a random and a linear regression would fit a flat slope).

For non-numerical variables, there is no concept of correlation (at least within the context of Pearson's r and this answer) and pd.DataFrame.corr simply ignores non-numerical (i.e., non-float or non-integer values) and drops these columns, explaining why you have less columns.

If your dropped values are in fact numerical but stored (for example) as strings, you probably need to convert them before calling .corr().

As an example:

x = np.random.rand(10)
y = np.random.rand(10)
x_scaled = x*6 
cat = ['one', 'two', 'three', 'four', 'five', 
       'six','seven', 'eight', 'nine', 'ten']

df = pd.DataFrame({'x':x, 'y':y, 'x_s':x_scaled, 'cat':cat})

df.corr()

returns:

        x            y          x_s
 x   1.000000    -0.470699    1.000000
 y  -0.470699     1.000000   -0.470699
x_s  1.000000    -0.470699    1.000000

which is our correlation matrix but our non-numerical column (cat) has been dropped.

If you plot the different numerical variables against each other you get the below plot:

pearsons_r_example

which helps highlight the different correlations: by chance there is a negative linear correlation between x and y.

like image 143
FChm Avatar answered Sep 20 '22 07:09

FChm