I'm trying to apply Linear Discriminant Analysis to a two-class problem. As far as I understand, LDA assumes that both classes have the same covariance matrix, and then models the likelihood as Gaussian distribution with different means.
Another classifier that I have tried is the naive Bayesian. It disregards any correlation between predictor variables.
Now, I don't understand what happens when I apply PCA to the dataset. By its very definition, the PCA algorithm rotates the data such that the covariance matrix is diagonal. Nothing is lost in the rotation, but since the covariance matrix is now diagonal, shouldn’t the naive Bayesian be just as good as LDA, or even better, since the LDA will have many more parameters to estimate? Yet, on my data, the Naive Bayes is outperformed by LDA with or without PCA.
Indeed, the data is exactly the same as long as I use all the principal components, which tells me that the result should indeed be as it is. Yet the covariance matrix is diagonal... brain meltdown
Can somebody explain this to me? I hope that I have phrased my question clearly enough. Thank you!
PCA is useful for dimensionality reduction (e.g., if the size of your training set is too small for the number of dimensions of the data). But if you are using all of the principal components, PCA won't improve the results of your linear classifier - if your classes weren't linearly separable in the original data space, then rotating your coordinates via PCA won't change that.
The other point to note regarding PCA is that it is calculates the coordinate rotation based on statistics of the entire set of samples, without regard to class labels. Even though your classes may be perfectly linearly separable (both before and after PCA), that doesn't imply that they will be separable independently by each of your transformed coordinates (i.e., the linear decision boundary isn't parallel to any of your transformed coordinate axes). In that type of situation, I would expect Naive Bayes to yield lower accuracy since, by definition, it treats each axis independently.
As far as I know, PCA by itself does not eliminate the dependencies between the variables, just finds a reparameterization of the data. If you then discarded all but a few top-variance dimensions (which is known as dimensionality reduction), this might help. I think, however, that in a setting with only two classes and many interdependent features using a discriminative classifier might work better that PCA+bayes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With