Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlated features and classification accuracy

I'd like to ask everyone a question about how correlated features (variables) affect the classification accuracy of machine learning algorithms. With correlated features I mean a correlation between them and not with the target class (i.e the perimeter and the area of a geometric figure or the level of education and the average income). In my opinion correlated features negatively affect eh accuracy of a classification algorithm, I'd say because the correlation makes one of them useless. Is it truly like this? Does the problem change with the respect of the classification algorithm type? Any suggestion on papers and lectures are really welcome! Thanks

like image 975
Titus Pullo Avatar asked Feb 11 '13 14:02

Titus Pullo


People also ask

Does correlation affect classification?

Correlated features do not affect classification accuracy per se.

What happens if two features are highly correlated?

When we have highly correlated features in the dataset, the values in “S” matrix will be small. So inverse square of “S” matrix (S^-2 in the above equation) will be large which makes the variance of Wₗₛ large. So, it is advised that we keep only one feature in the dataset if two features are highly correlated.

What happens if features are correlated?

Positive Correlation: means that if feature A increases then feature B also increases or if feature A decreases then feature B also decreases. Both features move in tandem and they have a linear relationship. Negative Correlation: means that if feature A increases then feature B decreases and vice versa.

Why are correlated features a problem?

The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison.


1 Answers

Correlated features do not affect classification accuracy per se. The problem in realistic situations is that we have a finite number of training examples with which to train a classifier. For a fixed number of training examples, increasing the number of features typically increases classification accuracy to a point but as the number of features continue to increase, classification accuracy will eventually decrease because we are then undersampled relative to the large number of features. To learn more about the implications of this, look at the curse of dimensionality.

If two numerical features are perfectly correlated, then one doesn't add any additional information (it is determined by the other). So if the number of features is too high (relative to the training sample size), then it is beneficial to reduce the number of features through a feature extraction technique (e.g., via principal components)

The effect of correlation does depend on the type of classifier. Some nonparametric classifiers are less sensitive to correlation of variables (although training time will likely increase with an increase in the number of features). For statistical methods such as Gaussian maximum likelihood, having too many correlated features relative to the training sample size will render the classifier unusable in the original feature space (the covariance matrix of the sample data becomes singular).

like image 200
bogatron Avatar answered Oct 08 '22 15:10

bogatron