Correlation coefficient explanation--Feature Selection

Tags:

How to determine the variables to be removed from our model based on the Correlation coefficient .

See below Example of variables:

Top 10 Absolute Correlations:
  Variable 1      Variable 2        Correlation Value
    pdays           pmonths           1.000000
    emp.var.rate    euribor3m         0.970955
    euribor3m       nr.employed       0.942545
    emp.var.rate    nr.employed       0.899818
    previous        pastEmail         0.798017
    emp.var.rate    cons.price.idx    0.763827
    cons.price.idx  euribor3m         0.670844
    contact         cons.price.idx    0.585899
    previous        nr.employed       0.504471
    cons.price.idx  nr.employed       0.490632

correlation matrix heat map of Independent variables":

Below picture is the correlation matrix heat map of Independent variables

Questions:

1)How to remove the one high correlated variable from Correlation-value calculated between two variables

Ex: correlation value between pdays and pmonths is 1.000000 Which variable to be removed from model ?days or pmonths? How the variable is determined ?

2)What is the correlation threshold range considered to drop a variable?ex:>0.65 or >0.90 etc

3)Can you please interpret above Heat map and give your explanation about the variables to be removed and reason for the same?

505

asked Jun 15 '20 15:06

Hell Boy

1 Answers

You could try to use another selection criteria for choosing between each pair of highly-correlated features. For example you can use the Information Gain (IG), which measures how much information a feature gives about the class (i.e., its reduction of entropy [TAL14], [SIL07]). Once you have detected a pair of highly-correlated features (e.g., as you mentioned pdays and pmonths) you can measure the IG of each variable and keep the one with the highest IG. Nevertheless, there are other selection criteria that you could also apply instead of IG (e.g., Mutual Information Maximization [BHS15]).

For the threshold, you can choose the value you want (it depends on your problem). However, for playing safe I would select a high value (e.g., 0.95) although you could also consider those ones around 0.94 or 0.9. Moreover, you can always stablish a high value and then play lowering that value to check the performance of your model.

[TAL14] Jiliang Tang, Salem Alelyani, and Huan Liu. Feature selection for classification: A review, pages 37–64. CRC Press, 1 2014.

[SIL07] Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007.

[BHS15] Mohamed Bennasar, Yulia Hicks, Rossitza Setchi. Feature selection using Joint Mutual Information Maximisation. Expert Systems with Applications, 42(22): 8520- 8532, 2015.

199

answered Sep 30 '22 12:09

kevin

Related questions
                            
                                how do i get_feature_names using a column transformer
                            
                                PySpark 2.4.5: IllegalArgumentException when using PandasUDF
                            
                                Pandas create zip file from ExcelWriter
                            
                                pysftp library not working in AWS lambda layer
                            
                                Get Bokeh's selection in notebook
                            
                                Split Pandas Dataframe Column According To a Value
                            
                                How to convert a spline fit into a piecewise function?
                            
                                What is the difference between single asterisk and double asterisks before variable in python? [duplicate]
                            
                                "Pillow was built without XCB support"
                            
                                Pandas apply, rolling, groupby with multiple input & multiple output columns
                            
                                How to define ylabel position relative to axis with matplotlib?
                            
                                Writing delta lake to AWS S3 (Without Databricks)
                            
                                How to set the Jinja environment variable in Flask?
                            
                                Topic modeling on short texts Python
                            
                                python multiprocessing : AttributeError: Can't pickle local object
                            
                                How to pass variable to JSON, for python?
                            
                                message.content.startswith Discord.Py
                            
                                Prunning model doesn't improve inference speed or reduce model size
                            
                                Get local time zone name on Windows (Python 3.9 zoneinfo)
                            
                                while loop requires a specific order to work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Correlation coefficient explanation--Feature Selection

Tags:

python

heatmap

correlation

feature-selection

Hell Boy

People also ask

1 Answers

kevin

Recent Activity

Donate For Us