I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..
Example data set
GA PN PC MBP GR AP 0.033 6.652 6.681 0.194 0.874 3.177 0.034 9.039 6.224 0.194 1.137 3.4 0.035 10.936 10.304 1.015 0.911 4.9 0.022 10.11 9.603 1.374 0.848 4.566 0.035 2.963 17.156 0.599 0.823 9.406 0.033 10.872 10.244 1.015 0.574 4.871 0.035 21.694 22.389 1.015 0.859 9.259 0.035 10.936 10.304 1.015 0.911 4.5
Please help....
To remove the correlated features, we can make use of the corr() method of the pandas dataframe. The corr() method returns a correlation matrix containing correlation between all the columns of the dataframe.
corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.
The method here worked well for me, only a few lines of code: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/
import numpy as np # Create correlation matrix corr_matrix = df.corr().abs() # Select upper triangle of correlation matrix upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)) # Find features with correlation greater than 0.95 to_drop = [column for column in upper.columns if any(upper[column] > 0.95)] # Drop features df.drop(to_drop, axis=1, inplace=True)
Here is the approach which I have used -
def correlation(dataset, threshold): col_corr = set() # Set of all the names of deleted columns corr_matrix = dataset.corr() for i in range(len(corr_matrix.columns)): for j in range(i): if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr): colname = corr_matrix.columns[i] # getting the name of column col_corr.add(colname) if colname in dataset.columns: del dataset[colname] # deleting the column from the dataset print(dataset)
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With