Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use principal component analysis in supervised machine learning classification problems?

I have been working through the concepts of principal component analysis in R.

I am comfortable with applying PCA to a (say, labeled) dataset and ultimately extracting out the most interesting first few principal components as numeric variables from my matrix.

The ultimate question is, in a sense, now what? Most of the reading I've come across on PCA immediately halts after the computations are done, especially with regards to machine learning. Pardon my hyperbole, but I feel as if everyone agrees that the technique is useful, but nobody wants to actually use it after they do it.

More specifically, here's my real question:

I respect that principle components are linear combinations of the variables you started with. So, how does this transformed data play a role in supervised machine learning? How could someone ever use PCA as a way to reduce dimensionality of a dataset, and THEN, use these components with a supervised learner, say, SVM?

I'm absolutely confused about what happens to our labels. Once we are in eigenspace, great. But I don't see any way to continue to move forward with machine learning if this transformation blows apart our concept of classification (unless there's some linear combination of "Yes" or "No" I haven't come across!)

Please step in and set me straight if you have the time and wherewithal. Thanks in advance.

like image 969
tumultous_rooster Avatar asked Nov 28 '13 02:11

tumultous_rooster


People also ask

Can principal component analysis be used for supervised learning?

A: PCA is great for exploring and understanding a data set. For pipelines where PCA is followed by a supervised learning algorithm, they are not suitable for model iterations for reasons listed above. However, they are handy for tasks such as quickly construct model performance benchmarks.

Can you use principal component analysis in classification?

Principal Component Analysis (PCA) is a great tool used by data scientists. It can be used to reduce feature space dimensionality and produce uncorrelated features. As we will see, it can also help you gain insight into the classification power of your data.


2 Answers

Old question, but I don't think it's been satisfactorily answered (and I just landed here myself through Google). I found myself in your same shoes and had to hunt down the answer myself.

The goal of PCA is to represent your data X in an orthonormal basis W; the coordinates of your data in this new basis is Z, as expressed below:

X = ZW'

Because of orthonormality, we can invert W simply by transposing it and write:

XW = Z

Now to reduce dimensionality, let's pick some number of components k < p. Assuming our basis vectors in W are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first k columns of W.

XW = Z

Now we have a k dimensional representation of our training data X. Now you run some supervised classifier using the new features in Z.

Y=f(Z)

The key is to realize that W is in some sense a canonical transformation from our space of p features down to a space of k features (or at least the best transformation we could find using our training data). Thus, we can hit our test data with the same W transformation, resulting in a k-dimensional set of test features:

XW = Z

We can now use the same classifier trained on the k-dimensional representation of our training data to make predictions on the k-dimensional representation of our test data:

Y=f(Z)

The point of going through this whole procedure is because you may have thousands of features, but (1) not all of them are going to have a meaningful signal and (2) your supervised learning method may be far too complex to train on the full feature set (either it would take too long or your computer wouldn't have a enough memory to process the calculations). PCA allows you to dramatically reduce the number of features it takes to represent your data without eliminating features of your data that truly add value.

like image 118
Alex P. Miller Avatar answered Sep 20 '22 13:09

Alex P. Miller


After you have used PCA on a portion of your data to compute the transformation matrix, you apply that matrix to each of your data points before submitting them to your classifier.

This is useful when the intrinsic dimensionality of your data is much smaller than the number of components and the gain in performance you get during classification is worth the loss in accuracy and the cost of PCA. Also, keep in mind the limitations of PCA:

  • In performing a linear transformation, you implicitly assume that all components are expressed in equivalent units.
  • Beyond variance, PCA is blind to the structure of your data. It may very well happen that the data splits along low-variance dimensions. In that case, the classifier won't learn from transformed data.
like image 27
Don Reba Avatar answered Sep 19 '22 13:09

Don Reba