Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply PCA on very large sparse matrix

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.

So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.

What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.

Thanks very much for any help !

Note: I am using a machine with 24GB RAM and 8 cpu cores.

like image 844
Ensom Hodder Avatar asked May 23 '12 10:05

Ensom Hodder


People also ask

Can we use PCA on sparse matrix?

Using the standard PCA, we can only select the most important midrange features, assuming each instance can be rebuilt using the same components. But by using the sparse method, we can use a limited number of components, but without the limitation given by a dense projection matrix.

How can sparse matrix dimensionality be reduced?

The dimensionality of the sparse matrix can be reduced by first representing the dense matrix as a Compressed sparse row representation in which the sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.


1 Answers

The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).

Disclaimer: I'm on the scikit-learn development team.

EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.

like image 97
Fred Foo Avatar answered Oct 26 '22 10:10

Fred Foo