I have about 1000 vectors x_i of dimension 50000, but they are very sparse; each has only about 50-100 nonzero elements. I want to do PCA on this dataset (in MATLAB) to reduce the unneeded extreme dimensionality of the data.
Unfortunately, I don't know any way to do this without an intermediate full matrix due to the need to subtract the means from all examples. And of course, a 1000x50000 matrix is too big to fit into memory (it actually crashes my entire computer for some reason when I try). Matlab's built in princomp
crashes my computer when I try to use it, too.
So my question is: is there a way to do PCA on this data without requiring a massive non-sparse matrix as an intermediate step?
You don't need to form the full data matrix to subtract the means, OR to compute the covariance matrix. Just compute the 1000x1000 covariance matrix iteratively (loop through the data vectors). Once you have formed the covariance matrix, you can implicitly subtract the means by centering the covariance matrix. See the section at the end of this paper on kernel PCA explaining how to center a kernel matrix. Just consider the kernel matrix basically the same as the covariance matrix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With