Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way of accessing non-zero values in row/column in scipy.sparse matrix

What is the fastest or, failing that, least wordy way of accessing all non-zero values in a row row or column col of a scipy.sparse matrix A in CSR format?

Would doing it in another format (say, COO) be more efficient?

Right now, I use the following:

A[row, A[row, :].nonzero()[1]]

or

A[A[:, col].nonzero()[0], col]
like image 959
musically_ut Avatar asked Dec 05 '16 22:12

musically_ut


People also ask

What are the different types of sparse matrices in SciPy?

Sparse matrices ( scipy.sparse ) Sparse linear algebra ( scipy.sparse.linalg ) Compressed sparse graph routines ( scipy.sparse.csgraph )

Which NumPy functions work on SciPy sparse arrays?

Many linear algebra NumPy and SciPy functions that operate on NumPy arrays can transparently operate on SciPy sparse arrays. Further, machine learning libraries that use NumPy data structures can also operate transparently on SciPy sparse arrays, such as scikit-learn for general machine learning and Keras for deep learning.

How do you represent a sparse matrix in Python?

Sparse Matrix Representations can be done in many ways following are two common representations: Array representation. Linked list representation. Method 1: Using Arrays: 2D array is used to represent a sparse matrix in which there are three rows named as. Row: Index of row, where non-zero element is located.

What is the problem with sparse matrices?

In both cases, the matrix contained is sparse with many more zero values than data values. The problem with representing these sparse matrices as dense matrices is that memory is required and must be allocated for each 32-bit or even 64-bit zero value in the matrix.


1 Answers

For a problem like this is pays to understand the underlying data structures for the different formats:

In [672]: A=sparse.csr_matrix(np.arange(24).reshape(4,6))
In [673]: A.data
Out[673]: 
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23], dtype=int32)
In [674]: A.indices
Out[674]: array([1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], dtype=int32)
In [675]: A.indptr
Out[675]: array([ 0,  5, 11, 17, 23], dtype=int32)

The data values for a row are a slice within A.data, but identifying that slice requires some knowledge of the A.indptr (see below)

For the coo.

In [676]: Ac=A.tocoo()
In [677]: Ac.data
Out[677]: 
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23], dtype=int32)
In [678]: Ac.row
Out[678]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], dtype=int32)
In [679]: Ac.col
Out[679]: array([1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], dtype=int32)

Note that A.nonzeros() converts to coo and returns the row and col attributes (more or less - look at its code).

For the lil format; data is stored by row in lists:

In [680]: Al=A.tolil()
In [681]: Al.data
Out[681]: 
array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]], dtype=object)
In [682]: Al.rows
Out[682]: 
array([[1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5]], dtype=object)

===============

Selecting a row of A works, though in my experience that tends to be a bit slow, in part because it has to create a new csr matrix. Also your expression seems wordier than needed.

Looking at my first row which has a 0 element (the others are too dense):

In [691]: A[0, A[0,:].nonzero()[1]].A
Out[691]: array([[1, 2, 3, 4, 5]], dtype=int32)

The whole row, expressed as a dense array is:

In [692]: A[0,:].A
Out[692]: array([[0, 1, 2, 3, 4, 5]], dtype=int32)

but the data attribute of that row is the same as your selection

In [693]: A[0,:].data
Out[693]: array([1, 2, 3, 4, 5], dtype=int32)

and with the lil format

In [694]: Al.data[0]
Out[694]: [1, 2, 3, 4, 5]

A[0,:].tocoo() doesn't add anything.

Direct access to attributes of a csr and lil isn't that good when picking columns. For that csc is better, or lil of the transpose.

Direct access to the csr data, with the aid of indptr, would be:

In [697]: i=0; A.data[A.indptr[i]:A.indptr[i+1]]
Out[697]: array([1, 2, 3, 4, 5], dtype=int32)

Calculations using the csr format routinely iterate through indptr like this, getting the values of each row - but they do this in compiled code.

A recent related topic, seeking the product of nonzero elements by row: Multiplying column elements of sparse Matrix

There I found the reduceat using indptr was quite fast.

Another tool when dealing with sparse matrices is multiplication

In [708]: (sparse.csr_matrix(np.array([1,0,0,0])[None,:])*A)
Out[708]: 
<1x6 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>

csr actually does sum with this kind of multiplication. And if my memory is correct, it actually performs A[0,:] this way

Sparse matrix slicing using list of int

like image 173
hpaulj Avatar answered Nov 01 '22 12:11

hpaulj