Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the result of scipy.sparse.csc_matrix.sum() change its type to numpy matrix?

I want to generate a large sparse matrix and sum it but I encounter MemoryError a lot. So I tried the operation via scipy.sparse.csc_matrix.sum instead but found that the type of data changed back to a numpy matrix after taking the sum.

window = 10    
np.random.seed = 0
mat = sparse.csc_matrix(np.random.rand(100, 120)>0.5, dtype='d')
print type(mat)
>>> <class 'scipy.sparse.csc.csc_matrix'>

mat_head = mat[:,0:window].sum(axis=1)
print type(mat_head)
>>> <class 'numpy.matrixlib.defmatrix.matrix'>

So I generated mat as zeros matrix just to test the result when mat_head is all zeros.

mat = sparse.csc_matrix((100,120))
print type(mat)
>>> <class 'scipy.sparse.csc.csc_matrix'>
mat_head = mat.sum(axis=1)
print type(mat_head)
>>> <class 'numpy.matrixlib.defmatrix.matrix'>
print np.count_nonzero(mat_head)
>>> 0

Why does this happen? So sum via scipy.sparse is not benefited for preserving memory than numpy as they change the data type back anyway?

like image 970
Jan Avatar asked Jun 06 '18 06:06

Jan


People also ask

What is the SciPy function which creates a sparse matrix?

Python's SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. The sparse matrix representation outputs the row-column tuple where the matrix contains non-zero values along with those values.

What does SciPy sparse Csr_matrix do?

The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format.

Does NumPy have sparse matrices?

Sparse Matrices in PythonA dense matrix stored in a NumPy array can be converted into a sparse matrix using the CSR representation by calling the csr_matrix() function.


2 Answers

As far as it is possible to give a hard reason for what is essentially a design choice I'd make the following argument:

The csr and csc formats are designed for sparse but not extremely sparse matrices. In particular, for an nxn matrix that has significantly fewer than n nonzeros these formats are rather wasteful because on top of the data and indices they carry a field indptr (delineating rows or columns) of size n+1.

Therefore assuming a properly utilized csc or csr matrix it is reasonable to expect row or column sums not to be sparse and the corresponding method should return a dense vector.

like image 68
Paul Panzer Avatar answered Nov 15 '22 05:11

Paul Panzer


I'm aware that your question of "why" mostly targets the motivation behind the design decision, but anyway I tracked down how the result of csc_matrix.sum(axis=1) actually becomes a numpy matrix.

The csc_matrix class inherits from the _cs_matrix class which inherits from the _data_matrix class which inherits from the spmatrix base class. This last one implements .sum(ax) as

if axis == 0:
    # sum over columns
    ret = np.asmatrix(np.ones(
        (1, m), dtype=res_dtype)) * self
else:
    # sum over rows
    ret = self * np.asmatrix(np.ones((n, 1), dtype=res_dtype))

In other words, as also noted in a comment, the column/row sums are computed by multiplying with a dense row or column matrix of ones, respectively. The result of this operation will be a dense matrix which you see on output.

While some of the subclasses override their .sum() method, as far as I could tell this only happens for the axis=None case, so the result which you see can be attributed to the above block of code.

like image 27