How do I use principal component analysis in supervised machine learning classification problems?

Tags:

I have been working through the concepts of principal component analysis in R.

I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.

The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.

More specifically, here's my real question:

I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?

I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)

Please step in and set me straight if you have the time and wherewithal. Thanks in advance.

969

asked Nov 28 '13 02:11

tumultous_rooster

2 Answers

Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.

The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:

$X = ZW'$

Because of orthonormality, we can invert W simply by transposing it and write:

$XW = Z$

Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.

$XW = Z$

Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.

$Y=f(Z)$

The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:

$XW = Z$

We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:

$Y=f(Z)$

The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.

118

answered Sep 20 '22 13:09

Alex P. Miller

After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.

This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:

In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.

answered Sep 19 '22 13:09

Don Reba

Related questions
                            
                                Keras Multitask learning with two different input sample size
                            
                                How to get the nearest neighbor in weka using java
                            
                                The relationship between latent Dirichlet allocation and documents clustering
                            
                                Can TF/IDF take classes in account
                            
                                Defining a gradient with respect to a subtensor in Theano
                            
                                Numpy linear regression with regularization
                            
                                How to interpret `scipy.stats.kstest` and `ks_2samp` to evaluate `fit` of data to a distribution?
                            
                                Are modern CNN (convolutional neural network) as DetectNet rotate invariant?
                            
                                Getting 'ValueError: shapes not aligned' on SciKit Linear Regression
                            
                                Tensorflow estimator: average_loss vs loss
                            
                                Trainable sklearn StandardScaler for R
                            
                                is it possible to implement dynamic class weights in keras?
                            
                                How Transformer is Bidirectional - Machine Learning
                            
                                How to load the saved tokenizer from pretrained model
                            
                                Implementing PCA with Numpy
                            
                                What is tape-based autograd in Pytorch?
                            
                                Compiling Caffe C++ Classification Example
                            
                                Keras: How to feed input directly into other hidden layers of the neural net than the first?
                            
                                Probability prediction method of KNeighborsClassifier returns only 0 and 1
                            
                                Keras LSTM - why different results with "same" model & same weights?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I use principal component analysis in supervised machine learning classification problems?

Tags:

machine-learning

supervised-learning

pca

principal-components

tumultous_rooster

People also ask

2 Answers

Alex P. Miller

Don Reba

Recent Activity

Donate For Us