Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently accumulating a collection of sparse scipy matrices

I've got a collection of O(N) NxN scipy.sparse.csr_matrix, and each sparse matrix has on the order of N elements set. I want to add all these matrices together to get a regular NxN numpy array. (N is on the order of 1000). The arrangement of non-zero elements within the matrices is such that the resulting sum certainly isn't sparse (virtually no zero elements left in fact).

At the moment I'm just doing

reduce(lambda x,y: x+y,[m.toarray() for m in my_sparse_matrices])

which works but is a bit slow: of course the sheer amount of pointless processing of zeros which is going on there is absolutely horrific.

Is there a better way ? There's nothing obvious to me in the docs.

Update: as per user545424's suggestion, I tried the alternative scheme of summing the sparse matrices, and also summing sparse matrices onto a dense matrix. The code below shows all approaches to run in comparable time (Python 2.6.6 on amd64 Debian/Squeeze on a quad-core i7)

import numpy as np
import numpy.random
import scipy
import scipy.sparse
import time

N=768
S=768
D=3

def mkrandomsparse():
    m=np.zeros((S,S),dtype=np.float32)
    r=np.random.random_integers(0,S-1,D*S)
    c=np.random.random_integers(0,S-1,D*S)
    for e in zip(r,c):
        m[e[0],e[1]]=1.0
    return scipy.sparse.csr_matrix(m)

M=[mkrandomsparse() for i in xrange(N)]

def plus_dense():
    return reduce(lambda x,y: x+y,[m.toarray() for m in M])

def plus_sparse():
    return reduce(lambda x,y: x+y,M).toarray()

def sum_dense():
    return sum([m.toarray() for m in M])

def sum_sparse():
    return sum(M[1:],M[0]).toarray()

def sum_combo():  # Sum the sparse matrices 'onto' a dense matrix?
    return sum(M,np.zeros((S,S),dtype=np.float32))

def benchmark(fn):
    t0=time.time()
    fn()
    t1=time.time()
    print "{0:16}:  {1:.3f}s".format(fn.__name__,t1-t0)

for i in xrange(4):
    benchmark(plus_dense)
    benchmark(plus_sparse)
    benchmark(sum_dense)
    benchmark(sum_sparse)
    benchmark(sum_combo)
    print

and logs out

plus_dense      :  1.368s
plus_sparse     :  1.405s
sum_dense       :  1.368s
sum_sparse      :  1.406s
sum_combo       :  1.039s

although you can get one approach or the other to come out ahead by a factor of 2 or so by messing with N,S,D parameters... but nothing like the order of magnitude improvement you'd hope to see from considering the number of zero adds it should be possible to skip.

like image 461
timday Avatar asked Jun 28 '12 23:06

timday


People also ask

How do you store sparse matrix efficiently?

A sparse matrix can be stored in full-matrix storage mode or a packed storage mode. When a sparse matrix is stored in full-matrix storage mode, all its elements, including its zero elements, are stored in an array.

How are sparse matrices stored efficiently in the computer's memory?

Thus, a sparse matrix is a matrix in which the number of zeros is more than the number of non-zero elements. If we store this sparse matrix as it is, it will consume a lot of space. Therefore, we store only non-zero values in the memory in a more efficient way.

Is sparse matrix memory efficient?

Using sparse matrices to store data that contains a large number of zero-valued elements can both save a significant amount of memory and speed up the processing of that data.

What is the advantage of using a sparse array over using a regular array?

Storage: When there is the maximum number of zero elements and the minimum number of non-zero elements then we use a sparse array over a simple array as it requires less memory to store the elements. In the sparse array, we only store the non-zero elements.


1 Answers

I think I've found a way to speed it up by a factor of ~10 if your matrices are very sparse.

In [1]: from scipy.sparse import csr_matrix

In [2]: def sum_sparse(m):
   ...:     x = np.zeros(m[0].shape)
   ...:     for a in m:
   ...:         ri = np.repeat(np.arange(a.shape[0]),np.diff(a.indptr))
   ...:         x[ri,a.indices] += a.data
   ...:     return x
   ...: 

In [6]: m = [np.zeros((100,100)) for i in range(1000)]

In [7]: for x in m:
   ...:     x.ravel()[np.random.randint(0,x.size,10)] = 1.0
   ...:     

        m = [csr_matrix(x) for x in m]

In [17]: (sum(m[1:],m[0]).todense() == sum_sparse(m)).all()
Out[17]: True

In [18]: %timeit sum(m[1:],m[0]).todense()
10 loops, best of 3: 145 ms per loop

In [19]: %timeit sum_sparse(m)
100 loops, best of 3: 18.5 ms per loop
like image 108
user545424 Avatar answered Sep 18 '22 15:09

user545424