Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA in matlab selecting top n components

I want to select the top N=10,000 principal components from a matrix. After the pca is completed, MATLAB should return a pxp matrix, but it doesn't!

>> size(train_data)
ans =
         400      153600

>> [coefs,scores,variances] = pca(train_data);
>> size(coefs)
ans =
      153600         399

>> size(scores)
ans =

   400   399
>> size(variances)
ans =
    399     1

It should be coefs:153600 x 153600? and scores:400 X 153600?

When I use the below code it gives me an Out of Memory error::

>> [V D] = eig(cov(train_data));
Out of memory. Type HELP MEMORY for your options.

Error in cov (line 96)
    xy = (xc' * xc) / (m-1);

I don't understand why MATLAB returns a lesser dimensional matrix. It should return an error with pca: 153600*153600*8 bytes=188 GB

Error with eigs:

>> eigs(cov(train_data));
Out of memory. Type HELP MEMORY for your options.

Error in cov (line 96)
    xy = (xc' * xc) / (m-1);
like image 673
Abhishek Bhatia Avatar asked Oct 15 '15 05:10

Abhishek Bhatia


People also ask

How do I choose the best PCA components?

If our sole intention of doing PCA is for data visualization, the best number of components is 2 or 3. If we really want to reduce the size of the dataset, the best number of principal components is much less than the number of variables in the original dataset.

How do you choose the number of principal components in PCA?

A widely applied approach is to decide on the number of principal components by examining a scree plot. By eyeballing the scree plot, and looking for a point at which the proportion of variance explained by each subsequent principal component drops off. This is often referred to as an elbow in the scree plot.

How do you select variables in PCA?

In each PC (1st to 5th) choose the variable with the highest score (irrespective of its positive or negative sign) as the most important variable. Since PCs are orthogonal in the PCA, selected variables will be completely independent (non-correlated).

Can you use PCA to select features?

The only way PCA is a valid method of feature selection is if the most important variables are the ones that happen to have the most variation in them .


2 Answers

Foreword

I think you are falling prey to the XY problem, since trying to find 153.600 dimensions in your data is completely non-physical, please ask about the problem (X) and not your proposed solution (Y) in order to get a meaningful answer. I will use this post only to tell you why PCA is not a good fit in this case. I cannot tell you what will solve your problem, since you have not told us what that is.

This is a mathematically unsound problem, as I will try to explain here.

PCA

PCA is, as user3149915 said, a way to reduce dimensions. This means that somewhere in your problem you have one-hundred-fifty-three-thousand-six-hundred dimensions floating around. That's a lot. A heck of a lot. Explaining a physical reason for the existence of all of them might be a bigger problem than trying to solve the mathematical problem.

Trying to fit that many dimensions to only 400 observations will not work, since even if all observations are linear independent vectors in your feature space, you can still extract only 399 dimensions, since the rest simply cannot be found since there are no observations. You can at most fit N-1 unique dimensions through N points, the other dimensions have an infinite number of possibilities of location. Like trying to fit a plane through two points: there's a line you can fit through those and the third dimension will be perpendicular to that line, but undefined in the rotational direction. Hence, you are left with an infinite number of possible planes that fit through those two points.

After the first 400 components, there's no more dimensions left. You are fitting a void after that. You used all your data to get the dimensions and cannot create more dimensions. Impossible. All you can do is get more observations, some 1.5M, and do the PCA again.

More observations than dimensions

Why do you need more observations than dimensions? you might ask. Easy, you cannot fit a unique line through a point, nor a unique plane through two points, nor a unique 153.600 dimensional hyperplane through 400 points.

So, if I get 153.600 observations I'm set?

Sadly, no. If you have two points and fit a line through it you get a 100% fit. No error, jay! Done for the day, let's go home and watch TV! Sadly, your boss will call you in the next morning since your fit is rubbish. Why? Well, if you'd have for instance 20 points scattered around, the fit would not be without errors, but at least closer to representing your actual data, since the first two could be outliers, see this very illustrative figure, where the red points would be your first two observations:

enter image description here

If you were to extract the first 10.000 components, that'd be 399 exact fits and 9601 zero dimensions. Might as well not even attempt to calculate beyond the 399th dimension, and stick that into a zero array with 10.000 entries.

TL;DR You cannot use PCA and we cannot help you solve your problem as long as you do not tell us what your problem is.

like image 106
Adriaan Avatar answered Sep 28 '22 09:09

Adriaan


PCA is a dimension reduction algorithm, as such it tries to reduce the number of features to principal components (PC) that each represents some linear combination of the total features. All of this is done in order to reduce the dimensions of the feature space, i.e. transform the large feature space to one that is more manageable but still retains most if not all of the information.

Now for your problem, you are trying to explain the variance across your 400 observations using 153600 features, however, we don't need that much information 399 PC's will explain 100% of the variance across your sample (I will be very surprised if that is not the case). The reason for that is basicly overfitting, your algorithm finds noise that explain every observation in your sample.

So what the rayryeng was telling you is correct, if you want to reduce your feature space to 10,000 PC's you will need 100,000 observations for the PC's to mean anything (that is a rule of thumb but a rather stable one).

And the reason that matlab was giving you 399 PC's because it was able to correctly extract 399 linear combinations that explained some #% of the variance across your sample.

If on the other hand what you are after are the most relevant features than you are not looking for dimensional reduction flows, but rather feature elimination processes. These will keep only the most relevant feature while nulling the irrelevant ones.

So just to make clear, if your feature space is rubbish and there isn't any information there just noise, the variance explained will be irrelevant and will indeed be less than 100% for example see the following

data = rand(400,401);
[coefs,scores,variances] = pca(data);
numel(variances) 
disp('Var explained ' num2str(cumsum(variances)) '%'])

Again if you want to reduce your feature space there are ways to that even with a small m, but PCA is not one of them.

Good Luck

like image 41
eyalsoreq Avatar answered Sep 28 '22 11:09

eyalsoreq