How to determine the variables to be removed from our model based on the Correlation coefficient .
See below Example of variables:
Top 10 Absolute Correlations:
Variable 1 Variable 2 Correlation Value
pdays pmonths 1.000000
emp.var.rate euribor3m 0.970955
euribor3m nr.employed 0.942545
emp.var.rate nr.employed 0.899818
previous pastEmail 0.798017
emp.var.rate cons.price.idx 0.763827
cons.price.idx euribor3m 0.670844
contact cons.price.idx 0.585899
previous nr.employed 0.504471
cons.price.idx nr.employed 0.490632
correlation matrix heat map of Independent variables":
Questions:
1)How to remove the one high correlated variable from Correlation-value calculated between two variables
Ex: correlation value between pdays and pmonths is 1.000000 Which variable to be removed from model ?days or pmonths? How the variable is determined ?
2)What is the correlation threshold range considered to drop a variable?ex:>0.65 or >0.90 etc
3)Can you please interpret above Heat map and give your explanation about the variables to be removed and reason for the same?
Correlation Coefficient The logic behind using correlation for feature selection is that the good variables are highly correlated with the target. Furthermore, variables should be correlated with the target but should be uncorrelated among themselves. If two variables are correlated, we can predict one from the other.
A correlation coefficient is a number between -1 and 1 that tells you the strength and direction of a relationship between variables. In other words, it reflects how similar the measurements of two or more variables are across a dataset.
A correlation between variables indicates that as one variable changes in value, the other variable tends to change in a specific direction. Understanding that relationship is useful because we can use the value of one variable to predict the value of the other variable.
Positive Correlation: means that if feature A increases then feature B also increases or if feature A decreases then feature B also decreases. Both features move in tandem and they have a linear relationship.
You could try to use another selection criteria for choosing between each pair of highly-correlated features. For example you can use the Information Gain (IG), which measures how much information a feature gives about the class (i.e., its reduction of entropy [TAL14], [SIL07]). Once you have detected a pair of highly-correlated features (e.g., as you mentioned pdays and pmonths) you can measure the IG of each variable and keep the one with the highest IG. Nevertheless, there are other selection criteria that you could also apply instead of IG (e.g., Mutual Information Maximization [BHS15]).
For the threshold, you can choose the value you want (it depends on your problem). However, for playing safe I would select a high value (e.g., 0.95) although you could also consider those ones around 0.94 or 0.9. Moreover, you can always stablish a high value and then play lowering that value to check the performance of your model.
[TAL14] Jiliang Tang, Salem Alelyani, and Huan Liu. Feature selection for classification: A review, pages 37–64. CRC Press, 1 2014.
[SIL07] Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007.
[BHS15] Mohamed Bennasar, Yulia Hicks, Rossitza Setchi. Feature selection using Joint Mutual Information Maximisation. Expert Systems with Applications, 42(22): 8520- 8532, 2015.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With