Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How to drop self correlation from correlation matrix

I'm trying to find highest correlations for different columns with pandas. I know can get correlation matrix with

df.corr()

I know I can get the highest correlations after that with

df.sort() 
df.stack() 
df[-5:]

The problem is that these correlation also contain values for column with the column itself (1). How do I remove these columns that contain correlation with self? I know I can remove them by removing all 1 values but I don't want to do that as there might be actual 1 correlations too.

like image 677
mikkom Avatar asked Feb 15 '16 09:02

mikkom


People also ask

How do I remove correlated feature?

To remove the correlated features, we can make use of the corr() method of the pandas dataframe. The corr() method returns a correlation matrix containing correlation between all the columns of the dataframe.

Should I remove highly correlated features?

In general, it is recommended to avoid having correlated features in your dataset. Indeed, a group of highly correlated features will not bring additional information (or just very few), but will increase the complexity of the algorithm, thus increasing the risk of errors.


2 Answers

Say you have

corrs = df.corr()

Then the problem is with the diagonal elements, IIUC. You can easily set them to some negative value, say -2 (which will necessarily be lower than all correlations) with

np.fill_diagonal(corrs.values, -2)

Example

(Many thanks to @Fabian Rost for the improvement & @jezrael for the DataFrame)

import numpy as np
df=pd.DataFrame( {
    'one':[0.1, .32, .2, 0.4, 0.8], 
    'two':[.23, .18, .56, .61, .12], 
    'three':[.9, .3, .6, .5, .3], 
    'four':[.34, .75, .91, .19, .21], 
    'zive': [0.1, .32, .2, 0.4, 0.8], 
    'six':[.9, .3, .6, .5, .3],
    'drive':[.9, .3, .6, .5, .3]})
corrs = df.corr()
np.fill_diagonal(corrs.values, -2)
>>> corrs
    drive   four    one six three   two zive
drive   -2.000000   -0.039607   -0.747365   1.000000    1.000000    0.238102    -0.747365
four    -0.039607   -2.000000   -0.489177   -0.039607   -0.039607   0.159583    -0.489177
one -0.747365   -0.489177   -2.000000   -0.747365   -0.747365   -0.351531   1.000000
six 1.000000    -0.039607   -0.747365   -2.000000   1.000000    0.238102    -0.747365
three   1.000000    -0.039607   -0.747365   1.000000    -2.000000   0.238102    -0.747365
two 0.238102    0.159583    -0.351531   0.238102    0.238102    -2.000000   -0.351531
zive    -0.747365   -0.489177   1.000000    -0.747365   -0.747365   -0.351531   -2.000000
like image 75
Ami Tavory Avatar answered Nov 09 '22 00:11

Ami Tavory


I recently found even cleaner answer to my question, you can compare multi-index levels by value.

This is what I ended using.

corr = df.corr().stack()
corr = corr[corr.index.get_level_values(0) != corr.index.get_level_values(1)]
like image 23
mikkom Avatar answered Nov 09 '22 02:11

mikkom