Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scipy.sparse matrix: subtract row mean to nonzero elements

I have a sparse matrix in csr_matrix format. For each row i need to subtract row mean from the nonzero elements. The means must be computed on the number of the nonzero elements of the row (instead of the length of the row). I found a fast way to compute the row means with the following code:

# M is a csr_matrix
sums = np.squeeze(np.asarray(M.sum(1)))    # sum of the nonzero elements, for each row
counts = np.diff(M.tocsr().indptr)         # count of the nonzero elements, for each row


# for the i-th row the mean is just sums[i] / float(counts[i]) 

The problem is the updates part. I need a fast way to do this. Actually what i am doing is to transform M to a lil_matrix and perform the updates in this way:

M = M.tolil()

for i in xrange(len(sums)):
    for j in M.getrow(i).nonzero()[1]:
        M[i, j] -= sums[i] / float(counts[i])

which is slow. Any suggestion for a faster solution?

like image 981
revy Avatar asked Sep 25 '16 09:09

revy


2 Answers

This one is tricky. I think I have it. The basic idea is that we try to get a diagonal matrix with the means on the diagonal, and a matrix that is like M, but has ones at the nonzero data locations in M. Then we multiply those and subtract the product from M. Here goes...

>>> import numpy as np
>>> import scipy.sparse as sp
>>> a = sp.csr_matrix([[1., 0., 2.], [1.,2.,3.]])
>>> a.todense()
matrix([[ 1.,  0.,  2.],
        [ 1.,  2.,  3.]])
>>> tot = np.array(a.sum(axis=1).squeeze())[0]
>>> tot
array([ 3.,  6.])
>>> cts = np.diff(a.indptr)
>>> cts
array([2, 3], dtype=int32)
>>> mu = tot/cts
>>> mu
array([ 1.5,  2. ])
>>> d = sp.diags(mu, 0)
>>> d.todense()
matrix([[ 1.5,  0. ],
        [ 0. ,  2. ]])
>>> b = a.copy()
>>> b.data = np.ones_like(b.data)
>>> b.todense()
matrix([[ 1.,  0.,  1.],
        [ 1.,  1.,  1.]])
>>> (d * b).todense()
matrix([[ 1.5,  0. ,  1.5],
        [ 2. ,  2. ,  2. ]])
>>> (a - d*b).todense()
matrix([[-0.5,  0. ,  0.5],
        [-1. ,  0. ,  1. ]])

Good Luck! Hope that helps.

like image 119
Dthal Avatar answered Oct 04 '22 12:10

Dthal


Starting with @Dthal's sample:

In [92]: a = sparse.csr_matrix([[1.,0,2],[1,2,3]])
In [93]: a.A
Out[93]: 
array([[ 1.,  0.,  2.],
       [ 1.,  2.,  3.]])

In [94]: sums=np.squeeze(a.sum(1).A)
# sums=a.sum(1).A1   # shortcut
In [95]: counts=np.diff(a.tocsr().indptr)
In [96]: means=sums/counts
In [97]: sums
Out[97]: array([ 3.,  6.])
In [98]: counts
Out[98]: array([2, 3], dtype=int32)
In [99]: means
Out[99]: array([ 1.5,  2. ])

repeat lets us replicate the means, producing an array that matches the matrix data in size.

In [100]: mc = np.repeat(means, counts)
In [101]: mc
Out[101]: array([ 1.5,  1.5,  2. ,  2. ,  2. ])

This mc is the same as @Dthal's (b*d).data.

Now just subtract it from data.

In [102]: a.data -= mc
In [103]: a.A
Out[103]: 
array([[-0.5,  0. ,  0.5],
       [-1. ,  0. ,  1. ]])
like image 38
hpaulj Avatar answered Oct 04 '22 10:10

hpaulj