When doing regression or classification, what is the correct (or better) way to preprocess the data? <ol> <li>Normalize the data -> PCA -> training</li> <li>PCA -> normalize PCA output -> training</li> <li>Normalize the data -> PCA -> normalize PCA output -> training</li> </ol> Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.

You should normalize the data before doing PCA. For example, consider the following situation. I create a data set <code>X</code> with a known correlation matrix <code>C</code>: <pre class="prettyprint"><code>>> C = [1 0.5; 0.5 1]; >> A = chol(rho); >> X = randn(100,2) * A; </code></pre> If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes: <pre class="prettyprint"><code>>> wts=pca(X) wts = 0.6659 0.7461 -0.7461 0.6659 </code></pre> If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change: <pre class="prettyprint"><code>>> Y = X; >> Y(:,1) = 100 * Y(:,1); </code></pre> However, we now find that the principal components are aligned with the coordinate axes: <pre class="prettyprint"><code>>> wts=pca(Y) wts = 1.0000 0.0056 -0.0056 1.0000 </code></pre> To resolve this, there are two options. First, I could rescale the data: <pre class="prettyprint"><code>>> Ynorm = bsxfun(@rdivide,Y,std(Y)) </code></pre> (The weird <code>bsxfun</code> notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature). We now get sensible results from PCA: <pre class="prettyprint"><code>>> wts = pca(Ynorm) wts = -0.7125 -0.7016 0.7016 -0.7125 </code></pre> They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally. The other option is to perform PCA using the correlation matrix of the data, instead of the outer product: <pre class="prettyprint"><code>>> wts = pca(Y,'corr') wts = 0.7071 0.7071 -0.7071 0.7071 </code></pre> In fact this is completely equivalent to standardizing the data by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).

PCA first or normalization first?

1 Answers

You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X with a known correlation matrix C:

>> C = [1 0.5; 0.5 1]; >> A = chol(rho); >> X = randn(100,2) * A;

If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes:

>> wts=pca(X) wts =     0.6659    0.7461    -0.7461    0.6659

If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change:

>> Y = X; >> Y(:,1) = 100 * Y(:,1);

However, we now find that the principal components are aligned with the coordinate axes:

>> wts=pca(Y) wts =     1.0000    0.0056    -0.0056    1.0000

To resolve this, there are two options. First, I could rescale the data:

>> Ynorm = bsxfun(@rdivide,Y,std(Y))

(The weird bsxfun notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature).

We now get sensible results from PCA:

>> wts = pca(Ynorm) wts =    -0.7125   -0.7016     0.7016   -0.7125

They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally.

The other option is to perform PCA using the correlation matrix of the data, instead of the outer product:

>> wts = pca(Y,'corr') wts =     0.7071    0.7071    -0.7071    0.7071

In fact this is completely equivalent to standardizing the data by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).

162

answered Sep 21 '22 13:09

Chris Taylor

Related questions
                            
                                Parsing ASN.1 binary data with Java
                            
                                How to have LED Light Notification?
                            
                                Meaning of double underscore in the beginning
                            
                                SQLite in a multithreaded java application
                            
                                Confused, whether java uses call by value or call by reference when an object reference is passed? [duplicate]
                            
                                Manage iOS Enterprise Developer Program
                            
                                SQL Server activity monitor show all queries
                            
                                weird space before prompt using tmux on Mac OS Lion
                            
                                CSV - Unquoted fields do not allow \r or \n (line 2)
                            
                                A single-line loop with a mandatory pair of braces in Java
                            
                                Set database collation in Entity Framework Code-First Initializer
                            
                                How to serialize Optional<T> classes with Gson?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PCA first or normalization first?

Tags:

AlanS

People also ask

1 Answers

Chris Taylor

Recent Activity

Donate For Us