I normally use
matrix[:, i:]
It seems not work as fast as I expected.
So we first convert the COO sparse matrix to CSR (Compressed Sparse Row format) matrix using tocsr() function. And then we can slice the sparse matrix rows using the row indices array we created. We can see that after slicing we get a sparse matrix of size 3×5 in CSR format.
The dimensionality of the sparse matrix can be reduced by first representing the dense matrix as a Compressed sparse row representation in which the sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.
Returns a copy of column i of the matrix, as a (m x 1) CSR matrix (column vector). Format of a matrix representation as a string. Maximum number of elements to display when printed. Number of stored values, including explicit zeros.
In a linked list representation, the linked list data structure is used to represent the sparse matrix. The advantage of using a linked list to represent the sparse matrix is that the complexity of inserting or deleting a node in a linked list is lesser than the array.
If you want to obtain a sparse matrix as output the fastest way to do row slicing is to have a csr
type, and for columns slicing csc
, as detailed here. In both cases you just have to do what you are currently doing:
matrix[l1:l2,c1:c2]
If you want another type as output there maybe faster ways. In this other answer it is explained many methods for slicing a matrix and their different timings compared. For example, if you want a ndarray
as output the fastest slicing is:
matrix.A[l1:l2,c1:c2]
or:
matrix.toarray()[l1:l2,c1:c2]
much faster than:
matrix[l1:l2,c1:c2].A #or .toarray()
I've found that the advertised fast row indexing of scipy.sparse.csr_matrix
can be made a lot quicker by rolling your own row indexer. Here's the idea:
class SparseRowIndexer: def __init__(self, csr_matrix): data = [] indices = [] indptr = [] # Iterating over the rows this way is significantly more efficient # than csr_matrix[row_index,:] and csr_matrix.getrow(row_index) for row_start, row_end in zip(csr_matrix.indptr[:-1], csr_matrix.indptr[1:]): data.append(csr_matrix.data[row_start:row_end]) indices.append(csr_matrix.indices[row_start:row_end]) indptr.append(row_end-row_start) # nnz of the row self.data = np.array(data) self.indices = np.array(indices) self.indptr = np.array(indptr) self.n_columns = csr_matrix.shape[1] def __getitem__(self, row_selector): data = np.concatenate(self.data[row_selector]) indices = np.concatenate(self.indices[row_selector]) indptr = np.append(0, np.cumsum(self.indptr[row_selector])) shape = [indptr.shape[0]-1, self.n_columns] return sparse.csr_matrix((data, indices, indptr), shape=shape)
That is, it is possible to utilize the fast indexing of numpy arrays by storing the non-zero values of each row in separate arrays (with a different length for each row) and putting all of those row arrays in an object-typed array (allowing each row to have a different size) that can be indexed efficiently. The column indices are stored the same way. The approach is slightly different to the standard CSR data structure which stores all non-zero values in a single array, requiring look-ups to see where each row starts and ends. These look-ups can slow down random access but should be efficient for retrieval of contiguous rows.
My matrix mat
is a 1,900,000x1,250,000 csr_matrix
with 400,000,000 non-zero elements. ilocs
is an array of 200,000 random row indices.
>>> %timeit mat[ilocs] 2.66 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
compared to:
>>> row_indexer = SparseRowIndexer(mat) >>> %timeit row_indexer[ilocs] 59.9 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The SparseRowIndexer seems to be faster when using fancy indexing compared to boolean masks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With