I have a set of 70 input variables on which I need to perform PCA. As per my understanding centering data such that for each input variable mean is 0
and variance is 1
, is necessary for applying PCA.
I am having a hard time figuring it out that do I need to perform standard scaling preprocessing.StandardScaler()
before passing my data set to PCA
or PCA
function in sklearn does it on its own.
If latter is the case then irrespective of if I do, or do not apply preprocessing.StandardScaler()
the explained_variance_ratio_
should be the same.
But the results are different, hence I believe preprocessing.StandardScaler()
is necessary before applying PCA
. Is it true?
Hello ! Yes, it is necessary to normalize data before performing PCA. The PCA calculates a new projection of your data set. And the new axis are based on the standard deviation of your variables.
Before PCA, we standardize/ normalize data. Usually, normalization is done so that all features are at the same scale. For example, we have different features for a housing prices prediction dataset.
Scaling (what I would call centering and scaling) is very important for PCA because of the way that the principal components are calculated.
Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. The first plot below shows the amount of total variance explained in the different principal components wher we have not normalized the data.
Yes, it' true, scikit-learn
's PCA does not apply standardization to the input dataset, it only centers it by subtracting the mean.
See also this post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With