I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data.. Example data set <pre class="prettyprint"><code> GA PN PC MBP GR AP 0.033 6.652 6.681 0.194 0.874 3.177 0.034 9.039 6.224 0.194 1.137 3.4 0.035 10.936 10.304 1.015 0.911 4.9 0.022 10.11 9.603 1.374 0.848 4.566 0.035 2.963 17.156 0.599 0.823 9.406 0.033 10.872 10.244 1.015 0.574 4.871 0.035 21.694 22.389 1.015 0.859 9.259 0.035 10.936 10.304 1.015 0.911 4.5 </code></pre> Please help....

Here is the approach which I have used - <pre class="prettyprint"><code>def correlation(dataset, threshold): col_corr = set() # Set of all the names of deleted columns corr_matrix = dataset.corr() for i in range(len(corr_matrix.columns)): for j in range(i): if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr): colname = corr_matrix.columns[i] # getting the name of column col_corr.add(colname) if colname in dataset.columns: del dataset[colname] # deleting the column from the dataset print(dataset) </code></pre> Hope this helps!

How to calculate correlation between all columns and remove highly correlated ones using pandas?

Tags:

python

pandas

correlation

I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..

Example data set

 GA      PN       PC     MBP      GR     AP    0.033   6.652   6.681   0.194   0.874   3.177     0.034   9.039   6.224   0.194   1.137   3.4       0.035   10.936  10.304  1.015   0.911   4.9       0.022   10.11   9.603   1.374   0.848   4.566     0.035   2.963   17.156  0.599   0.823   9.406     0.033   10.872  10.244  1.015   0.574   4.871      0.035   21.694  22.389  1.015   0.859   9.259      0.035   10.936  10.304  1.015   0.911   4.5

Please help....

848

asked Mar 27 '15 06:03

jax

2 Answers

The method here worked well for me, only a few lines of code: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

import numpy as np  # Create correlation matrix corr_matrix = df.corr().abs()  # Select upper triangle of correlation matrix upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))  # Find features with correlation greater than 0.95 to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]  # Drop features  df.drop(to_drop, axis=1, inplace=True)

answered Sep 20 '22 19:09

Cherry Wu

Here is the approach which I have used -

def correlation(dataset, threshold):     col_corr = set() # Set of all the names of deleted columns     corr_matrix = dataset.corr()     for i in range(len(corr_matrix.columns)):         for j in range(i):             if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):                 colname = corr_matrix.columns[i] # getting the name of column                 col_corr.add(colname)                 if colname in dataset.columns:                     del dataset[colname] # deleting the column from the dataset      print(dataset)

Hope this helps!

answered Sep 19 '22 19:09

NISHA DAGA

Related questions
                            
                                How can I set two primary key fields for my models in Django
                            
                                how to send the output of pprint module to a log file
                            
                                Avoiding "MySQL server has gone away" on infrequently used Python / Flask server with SQLAlchemy
                            
                                How to zip two differently sized lists?
                            
                                Use tqdm with concurrent.futures?
                            
                                How do I get the UTC time of "midnight" for a given timezone?
                            
                                python pandas flatten a dataframe to a list
                            
                                inheritance from str or int
                            
                                How can tox install the modules via the requirements file?
                            
                                Multiple inheritance in python3 with different signatures
                            
                                Multiple constructors: the Pythonic way? [duplicate]
                            
                                Best way to loop over a python string backwards
                            
                                Do I need to pass the full path of a file in another directory to open()?
                            
                                How to write a custom decorator in django?
                            
                                Matplotlib: Plotting numerous disconnected line segments with different colors
                            
                                Python Selenium Chrome Webdriver [duplicate]
                            
                                Executing an SQL query over a pandas dataset
                            
                                Plotly chart not showing in Jupyter notebook
                            
                                What is the pythonic way to count the leading spaces in a string?
                            
                                running multiple bash commands with subprocess

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With