scipy.sparse matrix: subtract row mean to nonzero elements

Question

I have a sparse matrix in csr_matrix format. For each row i need to subtract row mean from the nonzero elements. The means must be computed on the number of the nonzero elements of the row (instead of the length of the row). I found a fast way to compute the row means with the following code:

# M is a csr_matrix
sums = np.squeeze(np.asarray(M.sum(1)))    # sum of the nonzero elements, for each row
counts = np.diff(M.tocsr().indptr)         # count of the nonzero elements, for each row


# for the i-th row the mean is just sums[i] / float(counts[i])

The problem is the updates part. I need a fast way to do this. Actually what i am doing is to transform M to a lil_matrix and perform the updates in this way:

M = M.tolil()

for i in xrange(len(sums)):
    for j in M.getrow(i).nonzero()[1]:
        M[i, j] -= sums[i] / float(counts[i])

which is slow. Any suggestion for a faster solution?

Dthal · Accepted Answer

This one is tricky. I think I have it. The basic idea is that we try to get a diagonal matrix with the means on the diagonal, and a matrix that is like M, but has ones at the nonzero data locations in M. Then we multiply those and subtract the product from M. Here goes...

>>> import numpy as np
>>> import scipy.sparse as sp
>>> a = sp.csr_matrix([[1., 0., 2.], [1.,2.,3.]])
>>> a.todense()
matrix([[ 1.,  0.,  2.],
        [ 1.,  2.,  3.]])
>>> tot = np.array(a.sum(axis=1).squeeze())[0]
>>> tot
array([ 3.,  6.])
>>> cts = np.diff(a.indptr)
>>> cts
array([2, 3], dtype=int32)
>>> mu = tot/cts
>>> mu
array([ 1.5,  2. ])
>>> d = sp.diags(mu, 0)
>>> d.todense()
matrix([[ 1.5,  0. ],
        [ 0. ,  2. ]])
>>> b = a.copy()
>>> b.data = np.ones_like(b.data)
>>> b.todense()
matrix([[ 1.,  0.,  1.],
        [ 1.,  1.,  1.]])
>>> (d * b).todense()
matrix([[ 1.5,  0. ,  1.5],
        [ 2. ,  2. ,  2. ]])
>>> (a - d*b).todense()
matrix([[-0.5,  0. ,  0.5],
        [-1. ,  0. ,  1. ]])

Good Luck! Hope that helps.

hpaulj · Answer

Starting with @Dthal's sample:

In [92]: a = sparse.csr_matrix([[1.,0,2],[1,2,3]])
In [93]: a.A
Out[93]: 
array([[ 1.,  0.,  2.],
       [ 1.,  2.,  3.]])

In [94]: sums=np.squeeze(a.sum(1).A)
# sums=a.sum(1).A1   # shortcut
In [95]: counts=np.diff(a.tocsr().indptr)
In [96]: means=sums/counts
In [97]: sums
Out[97]: array([ 3.,  6.])
In [98]: counts
Out[98]: array([2, 3], dtype=int32)
In [99]: means
Out[99]: array([ 1.5,  2. ])

repeat lets us replicate the means, producing an array that matches the matrix data in size.

In [100]: mc = np.repeat(means, counts)
In [101]: mc
Out[101]: array([ 1.5,  1.5,  2. ,  2. ,  2. ])

This mc is the same as @Dthal's (b*d).data.

Now just subtract it from data.

In [102]: a.data -= mc
In [103]: a.A
Out[103]: 
array([[-0.5,  0. ,  0.5],
       [-1. ,  0. ,  1. ]])

scipy.sparse matrix: subtract row mean to nonzero elements

Tags:

python

scipy

updates

mean

sparse-matrix

revy

2 Answers

Dthal

hpaulj

Recent Activity

Donate For Us

scipy.sparse matrix: subtract row mean to nonzero elements

Tags:

python

scipy

updates

mean

sparse-matrix

revy

2 Answers

Dthal

hpaulj

Related questions

Recent Activity

Donate For Us