R internal handling of sparse matrices

Tags:

I have been comparing the performance of several PCA implementations from both Python and R, and noticed an interesting behavior:
While it seems impossible to compute the PCA of a sparse matrix in Python (the only approach would be scikit-learn's TruncatedSVD, yet it does not support the mean-centering required to be equivalent to a covariance solution for PCA. Their argumentation is, that it would destroy the sparsity property of the matrix. Other implementations like Facebook's PCA algorithm or the PCA/randomPCA method in scikit learn do not support sparse matrices for similar reasons.

While all of that makes sense to me, several R packages, like irlba, rsvd, etc., are able to handle sparse matrices (e.g. generated with rsparsematrix), and even allow for specific center=True arguments.

My question is, how R handles this internally, as it seems to be vastly more efficient than the comparable Python implementation. Does R still maintain the sparsity by doing Absolute Scaling instead (which would theoretically falsify the results, but at least maintain sparsity)? Or is there any way in which the mean can be stored explicitly for the zero values, and is only stored once (instead of for every value separately)?

To get put off hold: How does R internally store matrices with mean-centering without exploding RAM usage. Hope that is concise enough....

827

asked Jun 14 '18 08:06

dennlinger

1 Answers

The key here is that the underlying implementation for the partial SVD (restarted Lanczos bidiagonalization C code) doesn't store the matrix. You instead record the result of the linear operation from the matrix applied to a small set of vectors obtained from the previous iteration.

Rather than explaining the concrete method used in the c code, which is quite advanced (see paper for description),I will explain it with a much simpler algorithm that captures the key idea in terms of how to preserve the efficiency from sparsity: the power method (or the subspace iteration method for its generalization to multiple eigenvalues). The algorithm returns the largest eigenvalue of a matrix A by iteratively applying a linear operator, then normalizing (or orthogonalizing a small set of vectors, in the case of subspace iteration)

What you do at every iteration is

v=A*v v=v/norm(v)

The matrix multiplication step is the crucial one, so let's see what happens when we try the same thing with a centered A. The matrix formula for centered A (with center as the vector with the mean column values and ones as the vector of ones) is:

A_center=A-ones*transpose(center)

So if we apply the iterative algorithm to this new matrix we would get

v=A*v-dotproduct(center,v)*ones

Since A was sparse we can use the sparse matrix-vector product on (A,v) and -dotproduct(center,v)*ones just entails subtracting the dot product of center and v from the resulting vector which is linear on the dimension of A.

103

answered Sep 30 '22 14:09

Juan Carlos Ramirez

Related questions
                            
                                extracting text from MS word files in python
                            
                                matplotlib: how to decrease density of tick labels in subplots?
                            
                                How could I print out the nth letter of the alphabet in Python?
                            
                                Python randomly generated IP address as string
                            
                                Ways to invoke python and Spyder on OSX
                            
                                isinstance() and issubclass() return conflicting results
                            
                                Making a method private in a python subclass
                            
                                Python Fibonacci Generator
                            
                                Adding BOM (unicode signature) while saving file in python
                            
                                TypeError: sequence item 0: expected string, NoneType found
                            
                                'module' object has no attribute 'now' will trying to create a CSV
                            
                                Beautiful Soup Using Regex to Find Tags?
                            
                                How to get Network Interface Card names in Python?
                            
                                How to store data in GCS while accessing it from GAE and 'GCE' locally
                            
                                Theano with Keras on Raspberry Pi
                            
                                Django rest APIs, automate documentation?
                            
                                How to swap keys for values in a dictionary [duplicate]
                            
                                How to persist patsy DesignInfo?
                            
                                How do I configure the behavior of the Qt4Agg backend?
                            
                                Django & Redis: How do I properly use connection pooling?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R internal handling of sparse matrices

Tags:

python

r

scikit-learn

sparse-matrix

pca

dennlinger

People also ask

1 Answers

Juan Carlos Ramirez

Recent Activity

Donate For Us