Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

slicing sparse (scipy) matrix

I would appreciate any help, to understand following behavior when slicing a lil_matrix (A) from the scipy.sparse package.

Actually, I would like to extract a submatrix based on an arbitrary index list for both rows and columns.

When I used this two lines of code:

x1 = A[list 1,:]
x2 = x1[:,list 2]

Everything was fine and I could extract the right submatrix.

When I tried to do this in one line, it failed (The returning matrix was empty)

x=A[list 1,list 2]

Why is this so? Overall, I have used a similar command in matlab and there it works. So, why not use the first, since it works? It seems to be quite time consuming. Since I have to go through a large amount of entries, I would like to speed it up using a single command. Maybe I use the wrong sparse matrix type...Any idea?

like image 560
user972858 Avatar asked Sep 30 '11 10:09

user972858


People also ask

How do you slice a sparse matrix?

So we first convert the COO sparse matrix to CSR (Compressed Sparse Row format) matrix using tocsr() function. And then we can slice the sparse matrix rows using the row indices array we created. We can see that after slicing we get a sparse matrix of size 3×5 in CSR format.

What is a SciPy sparse matrix?

Python's SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. The sparse matrix representation outputs the row-column tuple where the matrix contains non-zero values along with those values.

How do you reduce sparse matrix in python?

The dimensionality of the sparse matrix can be reduced by first representing the dense matrix as a Compressed sparse row representation in which the sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.

How do you convert a matrix into a sparse matrix in python?

Create an empty list which will represent the sparse matrix list. Iterate through the 2D matrix to find non zero elements. If an element is non zero, create a temporary empty list. Append the row value, column value, and the non zero element itself into the temporary list.


2 Answers

The method you are already using,

A[list1, :][:, list2]

seems to be the fastest way to select the desired values from a spares matrix. See below for a benchmark.

However, to answer your question about how to select values from arbitrary rows and columns of A with a single index, you would need to use so-called "advanced indexing":

A[np.array(list1)[:,np.newaxis], np.array(list2)]

With advanced indexing, if arr1 and arr2 are NDarrays, the (i,j) component of A[arr1, arr2] equals

A[arr1[i,j], arr2[i,j]]

Thus you would want arr1[i,j] to equal list1[i] for all j, and arr2[i,j] to equal list2[j] for all i.

That can be arranged with the help of broadcasting (see below) by setting arr1 = np.array(list1)[:,np.newaxis], and arr2 = np.array(list2).

The shape of arr1 is (len(list1), 1) while the shape of arr2 is (len(list2), ) which broadcasts to (1, len(list2)) since new axes are added on the left automatically when needed.

Each array can be further broadcasted to shape (len(list1),len(list2)). This is exactly what we want for A[arr1[i,j],arr2[i,j]] to make sense, since we want (i,j) to run over all possible indices for a result array of shape (len(list1),len(list2)).


Here is a microbenchmark for one test case which suggests that A[list1, :][:, list2] is the fastest option:

In [32]: %timeit orig(A, list1, list2)
10 loops, best of 3: 110 ms per loop

In [34]: %timeit using_listener(A, list1, list2)
1 loop, best of 3: 1.29 s per loop

In [33]: %timeit using_advanced_indexing(A, list1, list2)
1 loop, best of 3: 1.8 s per loop

Here is the setup I used for the benchmark:

import numpy as np
import scipy.sparse as sparse
import random
random.seed(1)

def setup(N):
    A = sparse.rand(N, N, .1, format='lil')
    list1 = np.random.choice(N, size=N//10, replace=False).tolist()
    list2 = np.random.choice(N, size=N//20, replace=False).tolist()
    return A, list1, list2

def orig(A, list1, list2):
    return A[list1, :][:, list2]

def using_advanced_indexing(A, list1, list2):
    B = A.tocsc()  # or `.tocsr()`
    B = B[np.array(list1)[:, np.newaxis], np.array(list2)]
    return B

def using_listener(A, list1, list2):
    """https://stackoverflow.com/a/26592783/190597 (listener)"""
    B = A.tocsr()[list1, :].tocsc()[:, list2]
    return B

N = 10000
A, list1, list2 = setup(N)
B = orig(A, list1, list2)
C = using_advanced_indexing(A, list1, list2)
D = using_listener(A, list1, list2)
assert np.allclose(B.toarray(), C.toarray())
assert np.allclose(B.toarray(), D.toarray())
like image 85
unutbu Avatar answered Oct 21 '22 08:10

unutbu


for me the solution from unutbu works well, but is slow.

I found as a fast alternative,

A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]

You can see that row'S and col's get cut separately, but each one converted to the fastest sparse format, to get index this time.

In my test environment this code is 1000 times faster than the other one.

I hope, I don't tell something wrong or make a mistake.

like image 38
listener Avatar answered Oct 21 '22 08:10

listener