Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA first or normalization first?

Tags:

When doing regression or classification, what is the correct (or better) way to preprocess the data?

  1. Normalize the data -> PCA -> training
  2. PCA -> normalize PCA output -> training
  3. Normalize the data -> PCA -> normalize PCA output -> training

Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.

like image 685
AlanS Avatar asked Apr 12 '12 08:04

AlanS


People also ask

Do you normalize before PCA?

Before PCA, we standardize/ normalize data. Usually, normalization is done so that all features are at the same scale. For example, we have different features for a housing prices prediction dataset.

Does normalization affect PCA?

Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. The first plot below shows the amount of total variance explained in the different principal components wher we have not normalized the data.

Should we normalize data before feature selection?

Normalization: You would do normalization first to get data into reasonable bounds. If you have data (x,y) and the range of x is from -1000 to +1000 and y is from -1 to +1 You can see any distance metric would automatically say a change in y is less significant than a change in X.

Why is it important to standardize the dataset before applying PCA?

The main reason why we perform standardization before actually performing the PCA is that PCA is very sensitive to the variance of the original variables in the dataset.


1 Answers

You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X with a known correlation matrix C:

>> C = [1 0.5; 0.5 1]; >> A = chol(rho); >> X = randn(100,2) * A; 

If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes:

>> wts=pca(X) wts =     0.6659    0.7461    -0.7461    0.6659 

If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change:

>> Y = X; >> Y(:,1) = 100 * Y(:,1); 

However, we now find that the principal components are aligned with the coordinate axes:

>> wts=pca(Y) wts =     1.0000    0.0056    -0.0056    1.0000 

To resolve this, there are two options. First, I could rescale the data:

>> Ynorm = bsxfun(@rdivide,Y,std(Y)) 

(The weird bsxfun notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature).

We now get sensible results from PCA:

>> wts = pca(Ynorm) wts =    -0.7125   -0.7016     0.7016   -0.7125 

They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally.

The other option is to perform PCA using the correlation matrix of the data, instead of the outer product:

>> wts = pca(Y,'corr') wts =     0.7071    0.7071    -0.7071    0.7071 

In fact this is completely equivalent to standardizing the data by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).

like image 162
Chris Taylor Avatar answered Sep 21 '22 13:09

Chris Taylor