I have been comparing the performance of several PCA implementations from both Python and R, and noticed an interesting behavior:
While it seems impossible to compute the PCA of a sparse matrix in Python (the only approach would be scikit-learn's TruncatedSVD, yet it does not support the mean-centering required to be equivalent to a covariance solution for PCA. Their argumentation is, that it would destroy the sparsity property of the matrix. Other implementations like Facebook's PCA algorithm or the PCA/randomPCA method in scikit learn do not support sparse matrices for similar reasons.
While all of that makes sense to me, several R packages, like irlba, rsvd, etc., are able to handle sparse matrices (e.g. generated with rsparsematrix
), and even allow for specific center=True
arguments.
My question is, how R handles this internally, as it seems to be vastly more efficient than the comparable Python implementation. Does R still maintain the sparsity by doing Absolute Scaling instead (which would theoretically falsify the results, but at least maintain sparsity)? Or is there any way in which the mean can be stored explicitly for the zero values, and is only stored once (instead of for every value separately)?
To get put off hold: How does R internally store matrices with mean-centering without exploding RAM usage. Hope that is concise enough....
The solution to representing and working with sparse matrices is to use an alternate data structure to represent the sparse data. The zero values can be ignored and only the data or non-zero values in the sparse matrix need to be stored or acted upon.
The dgCMatrix class is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the “standard” class for sparse numeric matrices in the Matrix package.
Thus, a sparse matrix is a matrix in which the number of zeros is more than the number of non-zero elements. If we store this sparse matrix as it is, it will consume a lot of space. Therefore, we store only non-zero values in the memory in a more efficient way.
A sparse matrix can be stored in full-matrix storage mode or a packed storage mode. When a sparse matrix is stored in full-matrix storage mode, all its elements, including its zero elements, are stored in an array.
The key here is that the underlying implementation for the partial SVD (restarted Lanczos bidiagonalization C code) doesn't store the matrix. You instead record the result of the linear operation from the matrix applied to a small set of vectors obtained from the previous iteration.
Rather than explaining the concrete method used in the c code, which is quite advanced (see paper for description),I will explain it with a much simpler algorithm that captures the key idea in terms of how to preserve the efficiency from sparsity: the power method (or the subspace iteration method for its generalization to multiple eigenvalues). The algorithm returns the largest eigenvalue of a matrix A by iteratively applying a linear operator, then normalizing (or orthogonalizing a small set of vectors, in the case of subspace iteration)
What you do at every iteration is
v=A*v v=v/norm(v)
The matrix multiplication step is the crucial one, so let's see what happens when we try the same thing with a centered A. The matrix formula for centered A (with center
as the vector with the mean column values and ones
as the vector of ones) is:
A_center=A-ones*transpose(center)
So if we apply the iterative algorithm to this new matrix we would get
v=A*v-dotproduct(center,v)*ones
Since A was sparse we can use the sparse matrix-vector product on (A,v) and -dotproduct(center,v)*ones
just entails subtracting the dot product of center and v from the resulting vector which is linear on the dimension of A
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With