I'm trying to find highest correlations for different columns with pandas. I know can get correlation matrix with
df.corr()
I know I can get the highest correlations after that with
df.sort()
df.stack()
df[-5:]
The problem is that these correlation also contain values for column with the column itself (1). How do I remove these columns that contain correlation with self? I know I can remove them by removing all 1 values but I don't want to do that as there might be actual 1 correlations too.
To remove the correlated features, we can make use of the corr() method of the pandas dataframe. The corr() method returns a correlation matrix containing correlation between all the columns of the dataframe.
In general, it is recommended to avoid having correlated features in your dataset. Indeed, a group of highly correlated features will not bring additional information (or just very few), but will increase the complexity of the algorithm, thus increasing the risk of errors.
Say you have
corrs = df.corr()
Then the problem is with the diagonal elements, IIUC. You can easily set them to some negative value, say -2 (which will necessarily be lower than all correlations) with
np.fill_diagonal(corrs.values, -2)
Example
(Many thanks to @Fabian Rost for the improvement & @jezrael for the DataFrame)
import numpy as np
df=pd.DataFrame( {
'one':[0.1, .32, .2, 0.4, 0.8],
'two':[.23, .18, .56, .61, .12],
'three':[.9, .3, .6, .5, .3],
'four':[.34, .75, .91, .19, .21],
'zive': [0.1, .32, .2, 0.4, 0.8],
'six':[.9, .3, .6, .5, .3],
'drive':[.9, .3, .6, .5, .3]})
corrs = df.corr()
np.fill_diagonal(corrs.values, -2)
>>> corrs
drive four one six three two zive
drive -2.000000 -0.039607 -0.747365 1.000000 1.000000 0.238102 -0.747365
four -0.039607 -2.000000 -0.489177 -0.039607 -0.039607 0.159583 -0.489177
one -0.747365 -0.489177 -2.000000 -0.747365 -0.747365 -0.351531 1.000000
six 1.000000 -0.039607 -0.747365 -2.000000 1.000000 0.238102 -0.747365
three 1.000000 -0.039607 -0.747365 1.000000 -2.000000 0.238102 -0.747365
two 0.238102 0.159583 -0.351531 0.238102 0.238102 -2.000000 -0.351531
zive -0.747365 -0.489177 1.000000 -0.747365 -0.747365 -0.351531 -2.000000
I recently found even cleaner answer to my question, you can compare multi-index levels by value.
This is what I ended using.
corr = df.corr().stack()
corr = corr[corr.index.get_level_values(0) != corr.index.get_level_values(1)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With