Some of my columns get missing when I use df.corr in Pandas

Tags:

Here is my code:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('death_regression2.csv')
data3 = data.replace(r'\s+', np.nan, regex = True)  


plt.figure(figsize=(90,90)) 
corr = data3.corr()

print(np.shape(list(corr)))
print(np.shape(data3))

(135,) (4909, 204)

So before I use the correlation function, the total number of parameters was 204(number of the columns) but after using data3.corr(), some parameters go missing, reduced to 135.

How do check the correlation between all columns in the data?

349

asked Mar 04 '19 09:03

Heean

1 Answers

Without seeing any additional data to understand why you are missing columns, we will have to inspect what pd.DataFrame.corr does.

As the documentation outlines it computes the pairwise correlations of columns. Because you specified no arguments is uses the default method and calculate Pearson's r, which measures the linear correlation between two variables (X, Y) and can take values between -1 and 1 corresponding to an exact negative linear correlation to an exact positive linear correlation and all the values in between, with 0 being no correlation (i.e., the plot of X against Y is a random and a linear regression would fit a flat slope).

For non-numerical variables, there is no concept of correlation (at least within the context of Pearson's r and this answer) and pd.DataFrame.corr simply ignores non-numerical (i.e., non-float or non-integer values) and drops these columns, explaining why you have less columns.

If your dropped values are in fact numerical but stored (for example) as strings, you probably need to convert them before calling .corr().

As an example:

x = np.random.rand(10)
y = np.random.rand(10)
x_scaled = x*6 
cat = ['one', 'two', 'three', 'four', 'five', 
       'six','seven', 'eight', 'nine', 'ten']

df = pd.DataFrame({'x':x, 'y':y, 'x_s':x_scaled, 'cat':cat})

df.corr()

returns:

        x            y          x_s
 x   1.000000    -0.470699    1.000000
 y  -0.470699     1.000000   -0.470699
x_s  1.000000    -0.470699    1.000000

which is our correlation matrix but our non-numerical column (cat) has been dropped.

If you plot the different numerical variables against each other you get the below plot:

pearsons_r_example

which helps highlight the different correlations: by chance there is a negative linear correlation between x and y.

143

answered Sep 20 '22 07:09

FChm

Related questions
                            
                                How to use user-defined class object as a networkx node?
                            
                                How can I clear an image with Django Rest Framework?
                            
                                Interpolate between two images
                            
                                How can I add a "show details" button to a tkinter messagebox?
                            
                                Python3 is suddenly gone (on macOS) - used it for at least a year
                            
                                Does the performance of numpy differ depending on the operating system?
                            
                                Where is the value when I do this in pandas Series
                            
                                Flake 8: "multiple statements on one line (colon)" only for variable name starting with "if"
                            
                                plotly.py: change line opacity, leave markers opaque
                            
                                VS Code python extension recently started complaining about a Path error on Win10
                            
                                Convert Points to Lines Geopandas
                            
                                Error "You must compile your model before using it" in case of LSTM and fit_generator in Keras
                            
                                How to change a python thread name from inside the thread on Windows?
                            
                                Checking that a pandas.Series.index contains a value
                            
                                OpenCV Rectangle Filled
                            
                                portable conda environment as a binary tarball
                            
                                Is the class generator (inheriting Sequence) thread safe in Keras/Tensorflow?
                            
                                How to change proxy on my webdriver multiple times on a single session?
                            
                                Why does a = a['k'] = {} create an infinitely nested dictionary?
                            
                                Python typing for a subclass of list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Some of my columns get missing when I use df.corr in Pandas

Tags:

python

pandas

correlation

Heean

People also ask

1 Answers

FChm

Recent Activity

Donate For Us