Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate correlation between all columns and remove highly correlated ones using pandas?

I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..

Example data set

 GA      PN       PC     MBP      GR     AP    0.033   6.652   6.681   0.194   0.874   3.177     0.034   9.039   6.224   0.194   1.137   3.4       0.035   10.936  10.304  1.015   0.911   4.9       0.022   10.11   9.603   1.374   0.848   4.566     0.035   2.963   17.156  0.599   0.823   9.406     0.033   10.872  10.244  1.015   0.574   4.871      0.035   21.694  22.389  1.015   0.859   9.259      0.035   10.936  10.304  1.015   0.911   4.5        

Please help....

like image 848
jax Avatar asked Mar 27 '15 06:03

jax


People also ask

How do you delete highly correlated variables in pandas?

To remove the correlated features, we can make use of the corr() method of the pandas dataframe. The corr() method returns a correlation matrix containing correlation between all the columns of the dataframe.

How do you find the correlation between all columns in Python?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.


2 Answers

The method here worked well for me, only a few lines of code: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

import numpy as np  # Create correlation matrix corr_matrix = df.corr().abs()  # Select upper triangle of correlation matrix upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))  # Find features with correlation greater than 0.95 to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]  # Drop features  df.drop(to_drop, axis=1, inplace=True) 
like image 95
Cherry Wu Avatar answered Sep 20 '22 19:09

Cherry Wu


Here is the approach which I have used -

def correlation(dataset, threshold):     col_corr = set() # Set of all the names of deleted columns     corr_matrix = dataset.corr()     for i in range(len(corr_matrix.columns)):         for j in range(i):             if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):                 colname = corr_matrix.columns[i] # getting the name of column                 col_corr.add(colname)                 if colname in dataset.columns:                     del dataset[colname] # deleting the column from the dataset      print(dataset) 

Hope this helps!

like image 42
NISHA DAGA Avatar answered Sep 19 '22 19:09

NISHA DAGA