Principal Component Analysis (PCA) on huge sparse dataset

Question

I have about 1000 vectors x_i of dimension 50000, but they are very sparse; each has only about 50-100 nonzero elements. I want to do PCA on this dataset (in MATLAB) to reduce the unneeded extreme dimensionality of the data.

Unfortunately, I don't know any way to do this without an intermediate full matrix due to the need to subtract the means from all examples. And of course, a 1000x50000 matrix is too big to fit into memory (it actually crashes my entire computer for some reason when I try). Matlab's built in princomp crashes my computer when I try to use it, too.

So my question is: is there a way to do PCA on this data without requiring a massive non-sparse matrix as an intermediate step?

flubdub · Accepted Answer

You don't need to form the full data matrix to subtract the means, OR to compute the covariance matrix. Just compute the 1000x1000 covariance matrix iteratively (loop through the data vectors). Once you have formed the covariance matrix, you can implicitly subtract the means by centering the covariance matrix. See the section at the end of this paper on kernel PCA explaining how to center a kernel matrix. Just consider the kernel matrix basically the same as the covariance matrix.

Principal Component Analysis (PCA) on huge sparse dataset

Tags:

machine-learning

matlab

sparse-matrix

pca

Sean

1 Answers

flubdub

Recent Activity

Donate For Us

Principal Component Analysis (PCA) on huge sparse dataset

Tags:

machine-learning

matlab

sparse-matrix

pca

Sean

1 Answers

flubdub

Related questions

Recent Activity

Donate For Us