I'm trying to find highest correlations for different columns with pandas. I know can get correlation matrix with <pre class="prettyprint"><code>df.corr() </code></pre> I know I can get the highest correlations after that with <pre class="prettyprint"><code>df.sort() df.stack() df[-5:] </code></pre> The problem is that these correlation also contain values for column with the column itself (1). How do I remove these columns that contain correlation with self? I know I can remove them by removing all 1 values but I don't want to do that as there might be actual 1 correlations too.

I recently found even cleaner answer to my question, you can compare multi-index levels by value. This is what I ended using. <pre class="prettyprint"><code>corr = df.corr().stack() corr = corr[corr.index.get_level_values(0) != corr.index.get_level_values(1)] </code></pre>

Pandas: How to drop self correlation from correlation matrix

Tags:

python

pandas

numpy

correlation

I'm trying to find highest correlations for different columns with pandas. I know can get correlation matrix with

df.corr()

I know I can get the highest correlations after that with

df.sort() 
df.stack() 
df[-5:]

The problem is that these correlation also contain values for column with the column itself (1). How do I remove these columns that contain correlation with self? I know I can remove them by removing all 1 values but I don't want to do that as there might be actual 1 correlations too.

677

asked Feb 15 '16 09:02

mikkom

2 Answers

Say you have

corrs = df.corr()

Then the problem is with the diagonal elements, IIUC. You can easily set them to some negative value, say -2 (which will necessarily be lower than all correlations) with

np.fill_diagonal(corrs.values, -2)

Example

(Many thanks to @Fabian Rost for the improvement & @jezrael for the DataFrame)

import numpy as np
df=pd.DataFrame( {
    'one':[0.1, .32, .2, 0.4, 0.8], 
    'two':[.23, .18, .56, .61, .12], 
    'three':[.9, .3, .6, .5, .3], 
    'four':[.34, .75, .91, .19, .21], 
    'zive': [0.1, .32, .2, 0.4, 0.8], 
    'six':[.9, .3, .6, .5, .3],
    'drive':[.9, .3, .6, .5, .3]})
corrs = df.corr()
np.fill_diagonal(corrs.values, -2)
>>> corrs
    drive   four    one six three   two zive
drive   -2.000000   -0.039607   -0.747365   1.000000    1.000000    0.238102    -0.747365
four    -0.039607   -2.000000   -0.489177   -0.039607   -0.039607   0.159583    -0.489177
one -0.747365   -0.489177   -2.000000   -0.747365   -0.747365   -0.351531   1.000000
six 1.000000    -0.039607   -0.747365   -2.000000   1.000000    0.238102    -0.747365
three   1.000000    -0.039607   -0.747365   1.000000    -2.000000   0.238102    -0.747365
two 0.238102    0.159583    -0.351531   0.238102    0.238102    -2.000000   -0.351531
zive    -0.747365   -0.489177   1.000000    -0.747365   -0.747365   -0.351531   -2.000000

answered Nov 09 '22 00:11

Ami Tavory

I recently found even cleaner answer to my question, you can compare multi-index levels by value.

This is what I ended using.

corr = df.corr().stack()
corr = corr[corr.index.get_level_values(0) != corr.index.get_level_values(1)]

answered Nov 09 '22 02:11

mikkom

Related questions
                            
                                Error in Tumblelog Application development using Flask and MongoEngine
                            
                                Tornado framework. TypeError: 'Future' object is not callable
                            
                                sklearn matrix factorization example
                            
                                google-app-engine 1.9.19 deploy failure
                            
                                How do I convert a .tsv to .csv?
                            
                                Merging and subtracting DataFrame columns in pandas?
                            
                                How do I call an Excel macro from Python using xlwings?
                            
                                Python,Scrapy, Pipeline: function "process_item" not getting called
                            
                                Check if points lies inside a convex hull
                            
                                How can I make flycheck use virtualenv
                            
                                Remove tag from text with BeautifulSoup
                            
                                Python importlib's analogue for imp.new_module()
                            
                                Why does mock ignore the instance/object passed to a mocked out method when it is called?
                            
                                How can I smooth elements of a two-dimensional array with differing gaussian functions in python?
                            
                                Speeding-up "for-loop" in image analysis when iterations are up to 40,000
                            
                                Django CharField blank vs default empty
                            
                                Merging two GEOJSON polygons in Python
                            
                                Amazon + Django each 12 hours appears that [Errno 5] Input/output error
                            
                                Pyserial does not play well with virtual port
                            
                                How to use pip with python3.5 after upgrade from 3.4?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: How to drop self correlation from correlation matrix

Tags:

python

pandas

numpy

correlation

mikkom

People also ask

2 Answers

Ami Tavory

mikkom

Recent Activity

Donate For Us