Apply PCA on very large sparse matrix

Tags:

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.

So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.

What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.

Thanks very much for any help !

Note: I am using a machine with 24GB RAM and 8 cpu cores.

844

asked May 23 '12 10:05

Ensom Hodder

1 Answers

The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).

Disclaimer: I'm on the scikit-learn development team.

EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.

answered Oct 26 '22 10:10

Fred Foo

Related questions
                            
                                How can I ask windows about if the RAM is running in single, dual or quad channel?
                            
                                How can I modify .xfdl files? (Update #1)
                            
                                Bitwise Interval Arithmetic
                            
                                Auto increment property in Neo4j
                            
                                Algorithm for fitting objects in a space
                            
                                Reducing seek times when reading many small files
                            
                                Detecting wind noise [closed]
                            
                                Programmers dictionary/lexicon for non native speakers
                            
                                Close-packing points in the plane?
                            
                                Find all subtrees of size N in an undirected graph
                            
                                Territory Map Generation
                            
                                Algorithms to efficiently "scale" or "resize" of an array of numbers (audio resampling)
                            
                                "Alphanumeric" hash - A-Z, 0-9
                            
                                2D peak finding algorithm in O(n) worst case time?
                            
                                What is the difference between Inappropriate Intimacy and Feature Envy?
                            
                                How to design a unit test for generating a PDF document?
                            
                                What's the reason for leaving an extra blank line at the end of a code file?
                            
                                What is the rationale behind zeroMQ context?
                            
                                How strict should I be in the "do the simplest thing that could possible work" while doing TDD
                            
                                How to design a data structure that allows one to search, insert and delete an integer X in O(1) time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply PCA on very large sparse matrix

Tags:

language-agnostic

machine-learning

sparse-matrix

pca

Ensom Hodder

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us