I have a sparse matrix in csr_matrix format. For each row i need to subtract row mean from the nonzero elements. The means must be computed on the number of the nonzero elements of the row (instead of the length of the row). I found a fast way to compute the row means with the following code:
# M is a csr_matrix
sums = np.squeeze(np.asarray(M.sum(1))) # sum of the nonzero elements, for each row
counts = np.diff(M.tocsr().indptr) # count of the nonzero elements, for each row
# for the i-th row the mean is just sums[i] / float(counts[i])
The problem is the updates part. I need a fast way to do this. Actually what i am doing is to transform M to a lil_matrix and perform the updates in this way:
M = M.tolil()
for i in xrange(len(sums)):
for j in M.getrow(i).nonzero()[1]:
M[i, j] -= sums[i] / float(counts[i])
which is slow. Any suggestion for a faster solution?
This one is tricky. I think I have it. The basic idea is that we try to get a diagonal matrix with the means on the diagonal, and a matrix that is like M, but has ones at the nonzero data locations in M. Then we multiply those and subtract the product from M. Here goes...
>>> import numpy as np
>>> import scipy.sparse as sp
>>> a = sp.csr_matrix([[1., 0., 2.], [1.,2.,3.]])
>>> a.todense()
matrix([[ 1., 0., 2.],
[ 1., 2., 3.]])
>>> tot = np.array(a.sum(axis=1).squeeze())[0]
>>> tot
array([ 3., 6.])
>>> cts = np.diff(a.indptr)
>>> cts
array([2, 3], dtype=int32)
>>> mu = tot/cts
>>> mu
array([ 1.5, 2. ])
>>> d = sp.diags(mu, 0)
>>> d.todense()
matrix([[ 1.5, 0. ],
[ 0. , 2. ]])
>>> b = a.copy()
>>> b.data = np.ones_like(b.data)
>>> b.todense()
matrix([[ 1., 0., 1.],
[ 1., 1., 1.]])
>>> (d * b).todense()
matrix([[ 1.5, 0. , 1.5],
[ 2. , 2. , 2. ]])
>>> (a - d*b).todense()
matrix([[-0.5, 0. , 0.5],
[-1. , 0. , 1. ]])
Good Luck! Hope that helps.
Starting with @Dthal's
sample:
In [92]: a = sparse.csr_matrix([[1.,0,2],[1,2,3]])
In [93]: a.A
Out[93]:
array([[ 1., 0., 2.],
[ 1., 2., 3.]])
In [94]: sums=np.squeeze(a.sum(1).A)
# sums=a.sum(1).A1 # shortcut
In [95]: counts=np.diff(a.tocsr().indptr)
In [96]: means=sums/counts
In [97]: sums
Out[97]: array([ 3., 6.])
In [98]: counts
Out[98]: array([2, 3], dtype=int32)
In [99]: means
Out[99]: array([ 1.5, 2. ])
repeat
lets us replicate the means
, producing an array that matches the matrix data
in size.
In [100]: mc = np.repeat(means, counts)
In [101]: mc
Out[101]: array([ 1.5, 1.5, 2. , 2. , 2. ])
This mc
is the same as @Dthal's
(b*d).data
.
Now just subtract it from data
.
In [102]: a.data -= mc
In [103]: a.A
Out[103]:
array([[-0.5, 0. , 0.5],
[-1. , 0. , 1. ]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With