Here is my code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('death_regression2.csv')
data3 = data.replace(r'\s+', np.nan, regex = True)
plt.figure(figsize=(90,90))
corr = data3.corr()
print(np.shape(list(corr)))
print(np.shape(data3))
(135,) (4909, 204)
So before I use the correlation function, the total number of parameters was 204(number of the columns) but after using data3.corr(), some parameters go missing, reduced to 135.
How do check the correlation between all columns in the data?
corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.
If you want to see the all columns in Pandas df. head(), then use this snippet before running your code. All column data will be visible.
Pandas will ignore the pairwise correlation if it has NaN value in one of the observations. We can verify that by removing the those values and checking the results.
Without seeing any additional data to understand why you are missing columns, we will have to inspect what pd.DataFrame.corr
does.
As the documentation outlines it computes the pairwise correlations of columns. Because you specified no arguments is uses the default method and calculate Pearson's r, which measures the linear correlation between two variables (X, Y) and can take values between -1 and 1 corresponding to an exact negative linear correlation to an exact positive linear correlation and all the values in between, with 0 being no correlation (i.e., the plot of X against Y is a random and a linear regression would fit a flat slope).
For non-numerical variables, there is no concept of correlation (at least within the context of Pearson's r and this answer) and pd.DataFrame.corr
simply ignores non-numerical (i.e., non-float or non-integer values) and drops these columns, explaining why you have less columns.
If your dropped values are in fact numerical but stored (for example) as strings, you probably need to convert them before calling .corr()
.
As an example:
x = np.random.rand(10)
y = np.random.rand(10)
x_scaled = x*6
cat = ['one', 'two', 'three', 'four', 'five',
'six','seven', 'eight', 'nine', 'ten']
df = pd.DataFrame({'x':x, 'y':y, 'x_s':x_scaled, 'cat':cat})
df.corr()
returns:
x y x_s
x 1.000000 -0.470699 1.000000
y -0.470699 1.000000 -0.470699
x_s 1.000000 -0.470699 1.000000
which is our correlation matrix but our non-numerical column (cat
) has been dropped.
If you plot the different numerical variables against each other you get the below plot:
which helps highlight the different correlations: by chance there is a negative linear correlation between x
and y
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With