I want to generate a large sparse matrix and sum it but I encounter MemoryError
a lot. So I tried the operation via scipy.sparse.csc_matrix.sum instead but found that the type of data changed back to a numpy matrix
after taking the sum.
window = 10
np.random.seed = 0
mat = sparse.csc_matrix(np.random.rand(100, 120)>0.5, dtype='d')
print type(mat)
>>> <class 'scipy.sparse.csc.csc_matrix'>
mat_head = mat[:,0:window].sum(axis=1)
print type(mat_head)
>>> <class 'numpy.matrixlib.defmatrix.matrix'>
So I generated mat
as zeros matrix just to test the result when mat_head
is all zeros.
mat = sparse.csc_matrix((100,120))
print type(mat)
>>> <class 'scipy.sparse.csc.csc_matrix'>
mat_head = mat.sum(axis=1)
print type(mat_head)
>>> <class 'numpy.matrixlib.defmatrix.matrix'>
print np.count_nonzero(mat_head)
>>> 0
Why does this happen? So sum via scipy.sparse
is not benefited for preserving memory than numpy
as they change the data type back anyway?
Python's SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. The sparse matrix representation outputs the row-column tuple where the matrix contains non-zero values along with those values.
The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format.
Sparse Matrices in PythonA dense matrix stored in a NumPy array can be converted into a sparse matrix using the CSR representation by calling the csr_matrix() function.
As far as it is possible to give a hard reason for what is essentially a design choice I'd make the following argument:
The csr and csc formats are designed for sparse but not extremely sparse matrices. In particular, for an nxn matrix that has significantly fewer than n nonzeros these formats are rather wasteful because on top of the data and indices they carry a field indptr (delineating rows or columns) of size n+1.
Therefore assuming a properly utilized csc or csr matrix it is reasonable to expect row or column sums not to be sparse and the corresponding method should return a dense vector.
I'm aware that your question of "why" mostly targets the motivation behind the design decision, but anyway I tracked down how the result of csc_matrix.sum(axis=1)
actually becomes a numpy matrix
.
The csc_matrix
class inherits from the _cs_matrix
class which inherits from the _data_matrix
class which inherits from the spmatrix
base class. This last one implements .sum(ax)
as
if axis == 0:
# sum over columns
ret = np.asmatrix(np.ones(
(1, m), dtype=res_dtype)) * self
else:
# sum over rows
ret = self * np.asmatrix(np.ones((n, 1), dtype=res_dtype))
In other words, as also noted in a comment, the column/row sums are computed by multiplying with a dense row or column matrix of ones, respectively. The result of this operation will be a dense matrix which you see on output.
While some of the subclasses override their .sum()
method, as far as I could tell this only happens for the axis=None
case, so the result which you see can be attributed to the above block of code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With