Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to slice a scipy.sparse matrix?

Tags:

I normally use

matrix[:, i:] 

It seems not work as fast as I expected.

like image 676
todpole3 Avatar asked Dec 12 '12 15:12

todpole3


People also ask

How do you slice a sparse matrix?

So we first convert the COO sparse matrix to CSR (Compressed Sparse Row format) matrix using tocsr() function. And then we can slice the sparse matrix rows using the row indices array we created. We can see that after slicing we get a sparse matrix of size 3×5 in CSR format.

How do you reduce sparse matrix in python?

The dimensionality of the sparse matrix can be reduced by first representing the dense matrix as a Compressed sparse row representation in which the sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.

What does Scipy sparse Csr_matrix do?

Returns a copy of column i of the matrix, as a (m x 1) CSR matrix (column vector). Format of a matrix representation as a string. Maximum number of elements to display when printed. Number of stored values, including explicit zeros.

Which data structure is the best in representing the sparse matrices?

In a linked list representation, the linked list data structure is used to represent the sparse matrix. The advantage of using a linked list to represent the sparse matrix is that the complexity of inserting or deleting a node in a linked list is lesser than the array.


2 Answers

If you want to obtain a sparse matrix as output the fastest way to do row slicing is to have a csr type, and for columns slicing csc, as detailed here. In both cases you just have to do what you are currently doing:

matrix[l1:l2,c1:c2] 

If you want another type as output there maybe faster ways. In this other answer it is explained many methods for slicing a matrix and their different timings compared. For example, if you want a ndarray as output the fastest slicing is:

matrix.A[l1:l2,c1:c2]  

or:

matrix.toarray()[l1:l2,c1:c2] 

much faster than:

matrix[l1:l2,c1:c2].A #or .toarray() 
like image 60
Saullo G. P. Castro Avatar answered Oct 02 '22 16:10

Saullo G. P. Castro


I've found that the advertised fast row indexing of scipy.sparse.csr_matrix can be made a lot quicker by rolling your own row indexer. Here's the idea:

class SparseRowIndexer:     def __init__(self, csr_matrix):         data = []         indices = []         indptr = []          # Iterating over the rows this way is significantly more efficient         # than csr_matrix[row_index,:] and csr_matrix.getrow(row_index)         for row_start, row_end in zip(csr_matrix.indptr[:-1], csr_matrix.indptr[1:]):              data.append(csr_matrix.data[row_start:row_end])              indices.append(csr_matrix.indices[row_start:row_end])              indptr.append(row_end-row_start) # nnz of the row          self.data = np.array(data)         self.indices = np.array(indices)         self.indptr = np.array(indptr)         self.n_columns = csr_matrix.shape[1]      def __getitem__(self, row_selector):         data = np.concatenate(self.data[row_selector])         indices = np.concatenate(self.indices[row_selector])         indptr = np.append(0, np.cumsum(self.indptr[row_selector]))          shape = [indptr.shape[0]-1, self.n_columns]          return sparse.csr_matrix((data, indices, indptr), shape=shape) 

That is, it is possible to utilize the fast indexing of numpy arrays by storing the non-zero values of each row in separate arrays (with a different length for each row) and putting all of those row arrays in an object-typed array (allowing each row to have a different size) that can be indexed efficiently. The column indices are stored the same way. The approach is slightly different to the standard CSR data structure which stores all non-zero values in a single array, requiring look-ups to see where each row starts and ends. These look-ups can slow down random access but should be efficient for retrieval of contiguous rows.

Profiling results

My matrix mat is a 1,900,000x1,250,000 csr_matrix with 400,000,000 non-zero elements. ilocs is an array of 200,000 random row indices.

>>> %timeit mat[ilocs] 2.66 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

compared to:

>>> row_indexer = SparseRowIndexer(mat) >>> %timeit row_indexer[ilocs] 59.9 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

The SparseRowIndexer seems to be faster when using fancy indexing compared to boolean masks.

like image 31
Sorig Avatar answered Oct 02 '22 16:10

Sorig