Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlation coefficient explanation--Feature Selection

How to determine the variables to be removed from our model based on the Correlation coefficient .

See below Example of variables:

Top 10 Absolute Correlations:
  Variable 1      Variable 2        Correlation Value
    pdays           pmonths           1.000000
    emp.var.rate    euribor3m         0.970955
    euribor3m       nr.employed       0.942545
    emp.var.rate    nr.employed       0.899818
    previous        pastEmail         0.798017
    emp.var.rate    cons.price.idx    0.763827
    cons.price.idx  euribor3m         0.670844
    contact         cons.price.idx    0.585899
    previous        nr.employed       0.504471
    cons.price.idx  nr.employed       0.490632

correlation matrix heat map of Independent variables":

Below picture is the correlation matrix heat map of Independent variables

Questions:

1)How to remove the one high correlated variable from Correlation-value calculated between two variables

Ex: correlation value between pdays and pmonths is 1.000000 Which variable to be removed from model ?days or pmonths? How the variable is determined ?

2)What is the correlation threshold range considered to drop a variable?ex:>0.65 or >0.90 etc

3)Can you please interpret above Heat map and give your explanation about the variables to be removed and reason for the same?

like image 505
Hell Boy Avatar asked Jun 15 '20 15:06

Hell Boy


People also ask

How correlation is use for feature selection?

Correlation Coefficient The logic behind using correlation for feature selection is that the good variables are highly correlated with the target. Furthermore, variables should be correlated with the target but should be uncorrelated among themselves. If two variables are correlated, we can predict one from the other.

How do you explain correlation coefficient?

A correlation coefficient is a number between -1 and 1 that tells you the strength and direction of a relationship between variables. In other words, it reflects how similar the measurements of two or more variables are across a dataset.

How does understanding correlation coefficient help you in decision making?

A correlation between variables indicates that as one variable changes in value, the other variable tends to change in a specific direction. Understanding that relationship is useful because we can use the value of one variable to predict the value of the other variable.

What does correlation between features mean?

Positive Correlation: means that if feature A increases then feature B also increases or if feature A decreases then feature B also decreases. Both features move in tandem and they have a linear relationship.


1 Answers

You could try to use another selection criteria for choosing between each pair of highly-correlated features. For example you can use the Information Gain (IG), which measures how much information a feature gives about the class (i.e., its reduction of entropy [TAL14], [SIL07]). Once you have detected a pair of highly-correlated features (e.g., as you mentioned pdays and pmonths) you can measure the IG of each variable and keep the one with the highest IG. Nevertheless, there are other selection criteria that you could also apply instead of IG (e.g., Mutual Information Maximization [BHS15]).

For the threshold, you can choose the value you want (it depends on your problem). However, for playing safe I would select a high value (e.g., 0.95) although you could also consider those ones around 0.94 or 0.9. Moreover, you can always stablish a high value and then play lowering that value to check the performance of your model.

[TAL14] Jiliang Tang, Salem Alelyani, and Huan Liu. Feature selection for classification: A review, pages 37–64. CRC Press, 1 2014.

[SIL07] Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007.

[BHS15] Mohamed Bennasar, Yulia Hicks, Rossitza Setchi. Feature selection using Joint Mutual Information Maximisation. Expert Systems with Applications, 42(22): 8520- 8532, 2015.

like image 199
kevin Avatar answered Sep 30 '22 12:09

kevin