I would appreciate any help, to understand following behavior when slicing a lil_matrix (A) from the scipy.sparse package. Actually, I would like to extract a submatrix based on an arbitrary index list for both rows and columns. When I used this two lines of code: <pre class="prettyprint"><code>x1 = A[list 1,:] x2 = x1[:,list 2] </code></pre> Everything was fine and I could extract the right submatrix. When I tried to do this in one line, it failed (The returning matrix was empty) <pre class="prettyprint"><code>x=A[list 1,list 2] </code></pre> Why is this so? Overall, I have used a similar command in matlab and there it works. So, why not use the first, since it works? It seems to be quite time consuming. Since I have to go through a large amount of entries, I would like to speed it up using a single command. Maybe I use the wrong sparse matrix type...Any idea?

The method you are already using, <pre class="prettyprint"><code>A[list1, :][:, list2] </code></pre> seems to be the fastest way to select the desired values from a spares matrix. See below for a benchmark. However, to answer your question about how to select values from arbitrary rows and columns of <code>A</code> with a single index, you would need to use so-called "advanced indexing": <pre class="prettyprint"><code>A[np.array(list1)[:,np.newaxis], np.array(list2)] </code></pre> With advanced indexing, if <code>arr1</code> and <code>arr2</code> are NDarrays, the <code>(i,j)</code> component of <code>A[arr1, arr2]</code> equals <pre class="prettyprint"><code>A[arr1[i,j], arr2[i,j]] </code></pre> Thus you would want <code>arr1[i,j]</code> to equal <code>list1[i]</code> for all <code>j</code>, and <code>arr2[i,j]</code> to equal <code>list2[j]</code> for all <code>i</code>. That can be arranged with the help of broadcasting (see below) by setting <code>arr1 = np.array(list1)[:,np.newaxis]</code>, and <code>arr2 = np.array(list2)</code>. The shape of <code>arr1</code> is <code>(len(list1), 1)</code> while the shape of <code>arr2</code> is <code>(len(list2), )</code> which broadcasts to <code>(1, len(list2))</code> since new axes are added on the left automatically when needed. Each array can be further broadcasted to shape <code>(len(list1),len(list2))</code>. This is exactly what we want for <code>A[arr1[i,j],arr2[i,j]]</code> to make sense, since we want <code>(i,j)</code> to run over all possible indices for a result array of shape <code>(len(list1),len(list2))</code>. <hr> Here is a microbenchmark for one test case which suggests that <code>A[list1, :][:, list2]</code> is the fastest option: <pre class="prettyprint"><code>In [32]: %timeit orig(A, list1, list2) 10 loops, best of 3: 110 ms per loop In [34]: %timeit using_listener(A, list1, list2) 1 loop, best of 3: 1.29 s per loop In [33]: %timeit using_advanced_indexing(A, list1, list2) 1 loop, best of 3: 1.8 s per loop </code></pre> <hr> Here is the setup I used for the benchmark: <pre class="prettyprint"><code>import numpy as np import scipy.sparse as sparse import random random.seed(1) def setup(N): A = sparse.rand(N, N, .1, format='lil') list1 = np.random.choice(N, size=N//10, replace=False).tolist() list2 = np.random.choice(N, size=N//20, replace=False).tolist() return A, list1, list2 def orig(A, list1, list2): return A[list1, :][:, list2] def using_advanced_indexing(A, list1, list2): B = A.tocsc() # or `.tocsr()` B = B[np.array(list1)[:, np.newaxis], np.array(list2)] return B def using_listener(A, list1, list2): """https://stackoverflow.com/a/26592783/190597 (listener)""" B = A.tocsr()[list1, :].tocsc()[:, list2] return B N = 10000 A, list1, list2 = setup(N) B = orig(A, list1, list2) C = using_advanced_indexing(A, list1, list2) D = using_listener(A, list1, list2) assert np.allclose(B.toarray(), C.toarray()) assert np.allclose(B.toarray(), D.toarray()) </code></pre>

slicing sparse (scipy) matrix

Tags:

python

slice

scipy

sparse-matrix

submatrix

I would appreciate any help, to understand following behavior when slicing a lil_matrix (A) from the scipy.sparse package.

Actually, I would like to extract a submatrix based on an arbitrary index list for both rows and columns.

When I used this two lines of code:

x1 = A[list 1,:]
x2 = x1[:,list 2]

Everything was fine and I could extract the right submatrix.

When I tried to do this in one line, it failed (The returning matrix was empty)

x=A[list 1,list 2]

Why is this so? Overall, I have used a similar command in matlab and there it works. So, why not use the first, since it works? It seems to be quite time consuming. Since I have to go through a large amount of entries, I would like to speed it up using a single command. Maybe I use the wrong sparse matrix type...Any idea?

560

asked Sep 30 '11 10:09

user972858

2 Answers

The method you are already using,

A[list1, :][:, list2]

seems to be the fastest way to select the desired values from a spares matrix. See below for a benchmark.

However, to answer your question about how to select values from arbitrary rows and columns of A with a single index, you would need to use so-called "advanced indexing":

A[np.array(list1)[:,np.newaxis], np.array(list2)]

With advanced indexing, if arr1 and arr2 are NDarrays, the (i,j) component of A[arr1, arr2] equals

A[arr1[i,j], arr2[i,j]]

Thus you would want arr1[i,j] to equal list1[i] for all j, and arr2[i,j] to equal list2[j] for all i.

That can be arranged with the help of broadcasting (see below) by setting arr1 = np.array(list1)[:,np.newaxis], and arr2 = np.array(list2).

The shape of arr1 is (len(list1), 1) while the shape of arr2 is (len(list2), ) which broadcasts to (1, len(list2)) since new axes are added on the left automatically when needed.

Each array can be further broadcasted to shape (len(list1),len(list2)). This is exactly what we want for A[arr1[i,j],arr2[i,j]] to make sense, since we want (i,j) to run over all possible indices for a result array of shape (len(list1),len(list2)).

Here is a microbenchmark for one test case which suggests that A[list1, :][:, list2] is the fastest option:

In [32]: %timeit orig(A, list1, list2)
10 loops, best of 3: 110 ms per loop

In [34]: %timeit using_listener(A, list1, list2)
1 loop, best of 3: 1.29 s per loop

In [33]: %timeit using_advanced_indexing(A, list1, list2)
1 loop, best of 3: 1.8 s per loop

Here is the setup I used for the benchmark:

import numpy as np
import scipy.sparse as sparse
import random
random.seed(1)

def setup(N):
    A = sparse.rand(N, N, .1, format='lil')
    list1 = np.random.choice(N, size=N//10, replace=False).tolist()
    list2 = np.random.choice(N, size=N//20, replace=False).tolist()
    return A, list1, list2

def orig(A, list1, list2):
    return A[list1, :][:, list2]

def using_advanced_indexing(A, list1, list2):
    B = A.tocsc()  # or `.tocsr()`
    B = B[np.array(list1)[:, np.newaxis], np.array(list2)]
    return B

def using_listener(A, list1, list2):
    """https://stackoverflow.com/a/26592783/190597 (listener)"""
    B = A.tocsr()[list1, :].tocsc()[:, list2]
    return B

N = 10000
A, list1, list2 = setup(N)
B = orig(A, list1, list2)
C = using_advanced_indexing(A, list1, list2)
D = using_listener(A, list1, list2)
assert np.allclose(B.toarray(), C.toarray())
assert np.allclose(B.toarray(), D.toarray())

answered Oct 21 '22 08:10

unutbu

for me the solution from unutbu works well, but is slow.

I found as a fast alternative,

A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]

You can see that row'S and col's get cut separately, but each one converted to the fastest sparse format, to get index this time.

In my test environment this code is 1000 times faster than the other one.

I hope, I don't tell something wrong or make a mistake.

answered Oct 21 '22 08:10

listener

Related questions
                            
                                Django ORM for desktop application
                            
                                Can't open Unicode URL with Python
                            
                                In python, what does len(list) do?
                            
                                Django: Paginator + raw SQL query
                            
                                Getting last insert id with SQLAlchemy
                            
                                How do I match contents of an element in XPath (lxml)?
                            
                                Performing non-blocking requests? - Django
                            
                                How do you extend the Site model in django?
                            
                                error with parse function in lxml
                            
                                How to use session on Google app engine
                            
                                Python generator, non-swallowing exception in 'coroutine'
                            
                                HTML presentation slides with Python syntax highlighting
                            
                                How can I get the values of the locals of a function after it has been executed?
                            
                                How do I schedule a task with Celery that runs on 1st of every month?
                            
                                how to install Matplotlib on Cygwin?
                            
                                django pagination and RawQuerySet
                            
                                TypeError: not all arguments converted during string formatting
                            
                                Django : Syncdb incorrectly warns that many-to-many field is stale
                            
                                Drawing & Rendering Multiway Tree in Python
                            
                                Max size of a file Python can open?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With