What is the fastest or, failing that, least wordy way of accessing all non-zero values in a row row
or column col
of a scipy.sparse
matrix A
in CSR
format?
Would doing it in another format (say, COO
) be more efficient?
Right now, I use the following:
A[row, A[row, :].nonzero()[1]]
or
A[A[:, col].nonzero()[0], col]
Sparse matrices ( scipy.sparse ) Sparse linear algebra ( scipy.sparse.linalg ) Compressed sparse graph routines ( scipy.sparse.csgraph )
Many linear algebra NumPy and SciPy functions that operate on NumPy arrays can transparently operate on SciPy sparse arrays. Further, machine learning libraries that use NumPy data structures can also operate transparently on SciPy sparse arrays, such as scikit-learn for general machine learning and Keras for deep learning.
Sparse Matrix Representations can be done in many ways following are two common representations: Array representation. Linked list representation. Method 1: Using Arrays: 2D array is used to represent a sparse matrix in which there are three rows named as. Row: Index of row, where non-zero element is located.
In both cases, the matrix contained is sparse with many more zero values than data values. The problem with representing these sparse matrices as dense matrices is that memory is required and must be allocated for each 32-bit or even 64-bit zero value in the matrix.
For a problem like this is pays to understand the underlying data structures for the different formats:
In [672]: A=sparse.csr_matrix(np.arange(24).reshape(4,6))
In [673]: A.data
Out[673]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23], dtype=int32)
In [674]: A.indices
Out[674]: array([1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], dtype=int32)
In [675]: A.indptr
Out[675]: array([ 0, 5, 11, 17, 23], dtype=int32)
The data
values for a row are a slice within A.data
, but identifying that slice requires some knowledge of the A.indptr
(see below)
For the coo
.
In [676]: Ac=A.tocoo()
In [677]: Ac.data
Out[677]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23], dtype=int32)
In [678]: Ac.row
Out[678]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], dtype=int32)
In [679]: Ac.col
Out[679]: array([1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], dtype=int32)
Note that A.nonzeros()
converts to coo
and returns the row
and col
attributes (more or less - look at its code).
For the lil
format; data is stored by row in lists:
In [680]: Al=A.tolil()
In [681]: Al.data
Out[681]:
array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]], dtype=object)
In [682]: Al.rows
Out[682]:
array([[1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5]], dtype=object)
===============
Selecting a row of A
works, though in my experience that tends to be a bit slow, in part because it has to create a new csr
matrix. Also your expression seems wordier than needed.
Looking at my first row which has a 0 element (the others are too dense):
In [691]: A[0, A[0,:].nonzero()[1]].A
Out[691]: array([[1, 2, 3, 4, 5]], dtype=int32)
The whole row, expressed as a dense array is:
In [692]: A[0,:].A
Out[692]: array([[0, 1, 2, 3, 4, 5]], dtype=int32)
but the data
attribute of that row is the same as your selection
In [693]: A[0,:].data
Out[693]: array([1, 2, 3, 4, 5], dtype=int32)
and with the lil
format
In [694]: Al.data[0]
Out[694]: [1, 2, 3, 4, 5]
A[0,:].tocoo()
doesn't add anything.
Direct access to attributes of a csr
and lil
isn't that good when picking columns. For that csc
is better, or lil
of the transpose.
Direct access to the csr
data
, with the aid of indptr
, would be:
In [697]: i=0; A.data[A.indptr[i]:A.indptr[i+1]]
Out[697]: array([1, 2, 3, 4, 5], dtype=int32)
Calculations using the csr
format routinely iterate through indptr
like this, getting the values of each row - but they do this in compiled code.
A recent related topic, seeking the product of nonzero elements by row: Multiplying column elements of sparse Matrix
There I found the reduceat
using indptr
was quite fast.
Another tool when dealing with sparse matrices is multiplication
In [708]: (sparse.csr_matrix(np.array([1,0,0,0])[None,:])*A)
Out[708]:
<1x6 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in Compressed Sparse Row format>
csr
actually does sum
with this kind of multiplication. And if my memory is correct, it actually performs A[0,:]
this way
Sparse matrix slicing using list of int
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With