I have a huge data set (32000*2500) that I need for training. This seems to be too much for my classifier, so I decided to do some reading on dimensionality reduction and specifically into PCA.
From my understanding, PCA selects the current data and replots them on another (x,y) domain/scale. These new coordinates don't mean anything but the data is rearranged to give one axis maximum variation. After these new coefficients I can drop the cooeff
having minimum variation.
Now I am trying to implement this in MatLab and am having trouble with the output provided. MatLab always considers rows as observations and columns as variables. So my inout to the pca
function would be my matrix of size (32000*2500)
. This would return the PCA coefficients in an output matrix of size 2500*2500
.
The help for pca states:
Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance.
In this output, which dimension is the observations of my data? I mean if I have to give this to the classifier, will the rows of coeff
represent my datas observations or is it now the columns of coeff
?
And how do I remove the coefficients having the least variation?
(Disclaimer: it's been a long time since I switched from matlab to scipy, but the principles are the same.)
If you use the svd
function
[U,S,V] = svd(X)
then to reduce the dimension of X
to k
, you'd multiply by the first k
columns of V
. In matlab, I'm guessing that's
X * V(:, 1: k);
Refer to Elements of Statistical Learning for the theory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With