Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting the components showing the most variance in PCA

I have a huge data set (32000*2500) that I need for training. This seems to be too much for my classifier, so I decided to do some reading on dimensionality reduction and specifically into PCA.

From my understanding, PCA selects the current data and replots them on another (x,y) domain/scale. These new coordinates don't mean anything but the data is rearranged to give one axis maximum variation. After these new coefficients I can drop the cooeff having minimum variation.

Now I am trying to implement this in MatLab and am having trouble with the output provided. MatLab always considers rows as observations and columns as variables. So my inout to the pca function would be my matrix of size (32000*2500). This would return the PCA coefficients in an output matrix of size 2500*2500.

The help for pca states:

Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance.

In this output, which dimension is the observations of my data? I mean if I have to give this to the classifier, will the rows of coeff represent my datas observations or is it now the columns of coeff?

And how do I remove the coefficients having the least variation?

like image 271
StuckInPhDNoMore Avatar asked Feb 27 '16 15:02

StuckInPhDNoMore


1 Answers

(Disclaimer: it's been a long time since I switched from matlab to scipy, but the principles are the same.)

If you use the svd function

[U,S,V] = svd(X)

then to reduce the dimension of X to k, you'd multiply by the first k columns of V. In matlab, I'm guessing that's

X * V(:, 1: k);

Refer to Elements of Statistical Learning for the theory.

like image 166
Ami Tavory Avatar answered Sep 20 '22 19:09

Ami Tavory