I want to iterate over the rows of a CSR Matrix and divide each element by the sum of the row, similar to this here:
numpy divide row by row sum
My problem is that I'm dealing with a large matrix: (96582, 350138)
And when applying the operation from the linked post, it bloats up my memory, since the returned matrix is dense.
So here is my first try:
for row in counts:
row = row / row.sum()
Unfortunately this doesn't affect the matrix at all, so I came up with the second idea to create a new csr matrix and concat the rows using vstack:
from scipy import sparse
import time
start_time = curr_time = time.time()
mtx = sparse.csr_matrix((0, counts.shape[1]))
for i, row in enumerate(counts):
prob_row = row / row.sum()
mtx = sparse.vstack([mtx, prob_row])
if i % 1000 == 0:
delta_time = time.time() - curr_time
total_time = time.time() - start_time
curr_time = time.time()
print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time))
This works well, but after some iterations it gets slower and slower:
step: 0, total time: 0, delta_time: 0
step: 1000, total time: 1, delta_time: 1
step: 2000, total time: 5, delta_time: 4
step: 3000, total time: 12, delta_time: 6
step: 4000, total time: 23, delta_time: 11
step: 5000, total time: 38, delta_time: 14
step: 6000, total time: 55, delta_time: 17
step: 7000, total time: 88, delta_time: 32
step: 8000, total time: 136, delta_time: 47
step: 9000, total time: 190, delta_time: 53
step: 10000, total time: 250, delta_time: 59
step: 11000, total time: 315, delta_time: 65
step: 12000, total time: 386, delta_time: 70
step: 13000, total time: 462, delta_time: 76
step: 14000, total time: 543, delta_time: 81
step: 15000, total time: 630, delta_time: 86
step: 16000, total time: 722, delta_time: 92
step: 17000, total time: 820, delta_time: 97
Any suggestions? Any idea why vstack gets slower and slower?
vstack
is an O(n)
operation because it needs to allocate memory for the result and then copy the contents of all arrays you passed as arguments into the result array.
You can simply use multiply
to do the operation:
>>> res = counts.multiply(1 / counts.sum(1)) # multiply with inverse
>>> res.todense()
matrix([[ 0.33333333, 0. , 0.66666667],
[ 0. , 0. , 1. ],
[ 0.26666667, 0.33333333, 0.4 ]])
But it's also quite easy to use np.lib.stride_tricks.as_strided
to do the operation you want (relatively performant). This as_strided
function also allows to do more complicated operations on the array (if there is no method or function for your case).
For example using an example csr of the scipy documentation:
>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>> row = np.array([0,0,1,2,2,2])
>>> col = np.array([0,2,2,0,1,2])
>>> data = np.array([1.,2,3,4,5,6])
>>> counts = csr_matrix( (data,(row,col)), shape=(3,3) )
>>> counts.todense()
matrix([[ 1., 0., 2.],
[ 0., 0., 3.],
[ 4., 5., 6.]])
You can divide each row by it's sum like this:
>>> row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr,
shape=(counts.shape[0], 2),
strides=2*counts.indptr.strides)
>>> for start, stop in row_start_stop:
... row = counts.data[start:stop]
... row /= row.sum()
>>> counts.todense()
matrix([[ 0.33333333, 0. , 0.66666667],
[ 0. , 0. , 1. ],
[ 0.26666667, 0.33333333, 0.4 ]])
@MSeifert answer is much more efficient, and that should be the proper way of doing. I think writing counts[i, :]
implies some column slicing is done which I didn't realize. The documentation explicitly says those are really inefficient operations on csr_matrix. Way have indeed a good example of that.
The doc says row slicing is efficient, I think you should do
for i in range(counts.shape[0]):
counts[i,:] /= counts[i,:].sum()
This way you edit your matrix in place, it stays sparse and you don't have to use vstack. I'm not sure it is the most efficient operation, but at least you should not have memory issues, and there is no slow down effect as the rows are being computed:
import time()
s = time.time()
for i in range(counts.shape[0]):
counts[i, :] /= (counts[i, :].sum() + 1)
if i % 1000 == 0:
e = time.time()
if i > 0:
print i, e-s
s = time.time()
1000 6.00199794769
2000 6.02894091606
3000 7.44459486008
4000 7.10011601448
5000 6.16998195648
6000 7.79510307312
7000 7.00139117241
8000 7.37821507454
9000 7.28075814247
...
row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, shape=(counts.shape[0], 2),
strides=2*counts.indptr.strides)
for i, (start, stop) in enumerate(row_start_stop):
row = counts.data[start:stop]
row /= row.sum()
if i % 1000 == 0:
e = time.time()
if i > 0:
print i,e-s
s = time.time()
1000 0.00735783576965
2000 0.0108380317688
3000 0.0102109909058
4000 0.0131571292877
5000 0.00670218467712
6000 0.00608897209167
7000 0.00663685798645
8000 0.0164499282837
9000 0.0061981678009
...
As to why using vstack
is slow, @MSeifert answer is great.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With