What is the fastest way to slice a scipy.sparse matrix?

Tags:

I normally use

matrix[:, i:]

It seems not work as fast as I expected.

676

asked Dec 12 '12 15:12

todpole3

2 Answers

If you want to obtain a sparse matrix as output the fastest way to do row slicing is to have a csr type, and for columns slicing csc, as detailed here. In both cases you just have to do what you are currently doing:

matrix[l1:l2,c1:c2]

If you want another type as output there maybe faster ways. In this other answer it is explained many methods for slicing a matrix and their different timings compared. For example, if you want a ndarray as output the fastest slicing is:

matrix.A[l1:l2,c1:c2]

or:

matrix.toarray()[l1:l2,c1:c2]

much faster than:

matrix[l1:l2,c1:c2].A #or .toarray()

answered Oct 02 '22 16:10

Saullo G. P. Castro

I've found that the advertised fast row indexing of scipy.sparse.csr_matrix can be made a lot quicker by rolling your own row indexer. Here's the idea:

class SparseRowIndexer:     def __init__(self, csr_matrix):         data = []         indices = []         indptr = []          # Iterating over the rows this way is significantly more efficient         # than csr_matrix[row_index,:] and csr_matrix.getrow(row_index)         for row_start, row_end in zip(csr_matrix.indptr[:-1], csr_matrix.indptr[1:]):              data.append(csr_matrix.data[row_start:row_end])              indices.append(csr_matrix.indices[row_start:row_end])              indptr.append(row_end-row_start) # nnz of the row          self.data = np.array(data)         self.indices = np.array(indices)         self.indptr = np.array(indptr)         self.n_columns = csr_matrix.shape[1]      def __getitem__(self, row_selector):         data = np.concatenate(self.data[row_selector])         indices = np.concatenate(self.indices[row_selector])         indptr = np.append(0, np.cumsum(self.indptr[row_selector]))          shape = [indptr.shape[0]-1, self.n_columns]          return sparse.csr_matrix((data, indices, indptr), shape=shape)

That is, it is possible to utilize the fast indexing of numpy arrays by storing the non-zero values of each row in separate arrays (with a different length for each row) and putting all of those row arrays in an object-typed array (allowing each row to have a different size) that can be indexed efficiently. The column indices are stored the same way. The approach is slightly different to the standard CSR data structure which stores all non-zero values in a single array, requiring look-ups to see where each row starts and ends. These look-ups can slow down random access but should be efficient for retrieval of contiguous rows.

Profiling results

My matrix mat is a 1,900,000x1,250,000 csr_matrix with 400,000,000 non-zero elements. ilocs is an array of 200,000 random row indices.

>>> %timeit mat[ilocs] 2.66 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

compared to:

>>> row_indexer = SparseRowIndexer(mat) >>> %timeit row_indexer[ilocs] 59.9 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The SparseRowIndexer seems to be faster when using fancy indexing compared to boolean masks.

answered Oct 02 '22 16:10

Sorig

Related questions
                            
                                Is it possible to implement lock free map in C++
                            
                                MVC WEB API routing fails when url contains encoded ampersand
                            
                                Checking HttpResponse OK (200) with Selenium WebDriver [duplicate]
                            
                                nock library - how to match any url
                            
                                HttpClient not storing cookies in CookieContainer
                            
                                Check for Session timeout in Laravel
                            
                                How does getting mysql's last insert ID work with transactions? + transaction questions
                            
                                How to use nUnit.Runners from NuGet
                            
                                Self-invoking anonymous functions
                            
                                No Certificates are available provisioning portal
                            
                                How do generics of generics work?
                            
                                Repository Disabled

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With