Efficiently accumulating a collection of sparse scipy matrices

Tags:

I've got a collection of O(N) NxN scipy.sparse.csr_matrix, and each sparse matrix has on the order of N elements set. I want to add all these matrices together to get a regular NxN numpy array. (N is on the order of 1000). The arrangement of non-zero elements within the matrices is such that the resulting sum certainly isn't sparse (virtually no zero elements left in fact).

At the moment I'm just doing

reduce(lambda x,y: x+y,[m.toarray() for m in my_sparse_matrices])

which works but is a bit slow: of course the sheer amount of pointless processing of zeros which is going on there is absolutely horrific.

Is there a better way ? There's nothing obvious to me in the docs.

Update: as per user545424's suggestion, I tried the alternative scheme of summing the sparse matrices, and also summing sparse matrices onto a dense matrix. The code below shows all approaches to run in comparable time (Python 2.6.6 on amd64 Debian/Squeeze on a quad-core i7)

import numpy as np
import numpy.random
import scipy
import scipy.sparse
import time

N=768
S=768
D=3

def mkrandomsparse():
    m=np.zeros((S,S),dtype=np.float32)
    r=np.random.random_integers(0,S-1,D*S)
    c=np.random.random_integers(0,S-1,D*S)
    for e in zip(r,c):
        m[e[0],e[1]]=1.0
    return scipy.sparse.csr_matrix(m)

M=[mkrandomsparse() for i in xrange(N)]

def plus_dense():
    return reduce(lambda x,y: x+y,[m.toarray() for m in M])

def plus_sparse():
    return reduce(lambda x,y: x+y,M).toarray()

def sum_dense():
    return sum([m.toarray() for m in M])

def sum_sparse():
    return sum(M[1:],M[0]).toarray()

def sum_combo():  # Sum the sparse matrices 'onto' a dense matrix?
    return sum(M,np.zeros((S,S),dtype=np.float32))

def benchmark(fn):
    t0=time.time()
    fn()
    t1=time.time()
    print "{0:16}:  {1:.3f}s".format(fn.__name__,t1-t0)

for i in xrange(4):
    benchmark(plus_dense)
    benchmark(plus_sparse)
    benchmark(sum_dense)
    benchmark(sum_sparse)
    benchmark(sum_combo)
    print

and logs out

plus_dense      :  1.368s
plus_sparse     :  1.405s
sum_dense       :  1.368s
sum_sparse      :  1.406s
sum_combo       :  1.039s

although you can get one approach or the other to come out ahead by a factor of 2 or so by messing with N,S,D parameters... but nothing like the order of magnitude improvement you'd hope to see from considering the number of zero adds it should be possible to skip.

461

asked Jun 28 '12 23:06

timday

1 Answers

I think I've found a way to speed it up by a factor of ~10 if your matrices are very sparse.

In [1]: from scipy.sparse import csr_matrix

In [2]: def sum_sparse(m):
   ...:     x = np.zeros(m[0].shape)
   ...:     for a in m:
   ...:         ri = np.repeat(np.arange(a.shape[0]),np.diff(a.indptr))
   ...:         x[ri,a.indices] += a.data
   ...:     return x
   ...: 

In [6]: m = [np.zeros((100,100)) for i in range(1000)]

In [7]: for x in m:
   ...:     x.ravel()[np.random.randint(0,x.size,10)] = 1.0
   ...:     

        m = [csr_matrix(x) for x in m]

In [17]: (sum(m[1:],m[0]).todense() == sum_sparse(m)).all()
Out[17]: True

In [18]: %timeit sum(m[1:],m[0]).todense()
10 loops, best of 3: 145 ms per loop

In [19]: %timeit sum_sparse(m)
100 loops, best of 3: 18.5 ms per loop

108

answered Sep 18 '22 15:09

user545424

Related questions
                            
                                Console windows closes right after I hit ctrl+F5 in visual studio tools for python
                            
                                Calculating Probability of a Random Variable in a Distribution in Python
                            
                                How to implement a persistent Python `list`?
                            
                                Does python 3.1.3 support unicode in csv module?
                            
                                Share memory areas between celery workers on one machine
                            
                                PostgreSQL multidimensional arrays in SQLAlchemy
                            
                                How do a load a python package resource from the current distribution using pkg_resources?
                            
                                How can I create a database using pymssql
                            
                                fast Cartesian to Polar to Cartesian in Python
                            
                                How to combine lines in two files with condition in python?
                            
                                How do I post unicode characters using httplib?
                            
                                django many-to-many field: prefetch primary keys only
                            
                                python import site failed
                            
                                I have modulus and private exponent. How to construct RSA private key and sign a message?
                            
                                Surf missing in opencv 2.4 for python
                            
                                How does the python difflib.get_close_matches() function work?
                            
                                Running Multiple spiders in scrapy
                            
                                Argparse: expected one argument
                            
                                Events framework for Python? [closed]
                            
                                TypeError:'datetime.datetime' object is not subscriptable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently accumulating a collection of sparse scipy matrices

Tags:

python

optimization

numpy

scipy

sparse-matrix

timday

People also ask

1 Answers

user545424

Recent Activity

Donate For Us