Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scale before PCA

I'm using PCA from sckit-learn and I'm getting some results which I'm trying to interpret, so I ran into question - should I subtract the mean (or perform standardization) before using PCA, or is this somehow embedded into sklearn implementation?

Moreover, which of the two should I perform, if so, and why is this step needed?

like image 819
Kobe-Wan Kenobi Avatar asked Dec 19 '22 13:12

Kobe-Wan Kenobi


2 Answers

I will try to explain it with an example. Suppose you have a dataset that includes a lot features about housing and your goal is to classify if a purchase is good or bad (a binary classification). The dataset includes some categorical variables (e.g. location of the house, condition, access to public transportation, etc.) and some float or integer numbers (e.g. market price, number of bedrooms etc). The first thing that you may do is to encode the categorical variables. For instance, if you have 100 locations in your dataset, the common way is to encode them from 0 to 99. You may even end up encoding these variables in one-hot encoding fashion (i.e. a column of 1 and 0 for each location) depending on the classifier that you are planning to use. Now if you use the price in million dollars, the price feature would have a much higher variance and thus higher standard deviation. Remember that we use square value of the difference from mean to calculate the variance. A bigger scale would create bigger values and square of a big value grow faster. But it does not mean that the price carry significantly more information compared to for instance location. In this example, however, PCA would give a very high weight to the price feature and perhaps the weights of categorical features would almost drop to 0. If you normalize your features, it provides a fair comparison between the explained variance in the dataset. So, it is good practice to normalize the mean and scale the features before using PCA.

like image 107
MhFarahani Avatar answered Dec 26 '22 11:12

MhFarahani


Before PCA, you should,

  1. Mean normalize (ALWAYS)

  2. Scale the features (if required)

Note: Please remember that step 1 and 2 are not the same technically.

like image 23
Phoenix Avatar answered Dec 26 '22 12:12

Phoenix