Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Principal Component Analysis (PCA) on huge sparse dataset

I have about 1000 vectors x_i of dimension 50000, but they are very sparse; each has only about 50-100 nonzero elements. I want to do PCA on this dataset (in MATLAB) to reduce the unneeded extreme dimensionality of the data.

Unfortunately, I don't know any way to do this without an intermediate full matrix due to the need to subtract the means from all examples. And of course, a 1000x50000 matrix is too big to fit into memory (it actually crashes my entire computer for some reason when I try). Matlab's built in princomp crashes my computer when I try to use it, too.

So my question is: is there a way to do PCA on this data without requiring a massive non-sparse matrix as an intermediate step?

like image 227
Sean Avatar asked Nov 16 '12 23:11

Sean


1 Answers

You don't need to form the full data matrix to subtract the means, OR to compute the covariance matrix. Just compute the 1000x1000 covariance matrix iteratively (loop through the data vectors). Once you have formed the covariance matrix, you can implicitly subtract the means by centering the covariance matrix. See the section at the end of this paper on kernel PCA explaining how to center a kernel matrix. Just consider the kernel matrix basically the same as the covariance matrix.

like image 79
flubdub Avatar answered Sep 17 '22 20:09

flubdub