Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large Numpy Scipy CSR Matrix, row wise operation

I want to iterate over the rows of a CSR Matrix and divide each element by the sum of the row, similar to this here:

numpy divide row by row sum

My problem is that I'm dealing with a large matrix: (96582, 350138)

And when applying the operation from the linked post, it bloats up my memory, since the returned matrix is dense.

So here is my first try:

for row in counts:
    row = row / row.sum()

Unfortunately this doesn't affect the matrix at all, so I came up with the second idea to create a new csr matrix and concat the rows using vstack:

from scipy import sparse
import time

start_time = curr_time = time.time()
mtx = sparse.csr_matrix((0, counts.shape[1]))
for i, row in enumerate(counts):
   prob_row = row / row.sum()
   mtx = sparse.vstack([mtx, prob_row])
   if i % 1000 == 0:
      delta_time = time.time() - curr_time
      total_time = time.time() - start_time
      curr_time = time.time()
      print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time))

This works well, but after some iterations it gets slower and slower:

step: 0, total time: 0, delta_time: 0
step: 1000, total time: 1, delta_time: 1
step: 2000, total time: 5, delta_time: 4
step: 3000, total time: 12, delta_time: 6
step: 4000, total time: 23, delta_time: 11
step: 5000, total time: 38, delta_time: 14
step: 6000, total time: 55, delta_time: 17
step: 7000, total time: 88, delta_time: 32
step: 8000, total time: 136, delta_time: 47
step: 9000, total time: 190, delta_time: 53
step: 10000, total time: 250, delta_time: 59
step: 11000, total time: 315, delta_time: 65
step: 12000, total time: 386, delta_time: 70
step: 13000, total time: 462, delta_time: 76
step: 14000, total time: 543, delta_time: 81
step: 15000, total time: 630, delta_time: 86
step: 16000, total time: 722, delta_time: 92
step: 17000, total time: 820, delta_time: 97

Any suggestions? Any idea why vstack gets slower and slower?

like image 701
nadre Avatar asked Jul 28 '17 12:07

nadre


2 Answers

vstack is an O(n) operation because it needs to allocate memory for the result and then copy the contents of all arrays you passed as arguments into the result array.

You can simply use multiply to do the operation:

>>> res = counts.multiply(1 / counts.sum(1))  # multiply with inverse
>>> res.todense()
matrix([[ 0.33333333,  0.        ,  0.66666667],
        [ 0.        ,  0.        ,  1.        ],
        [ 0.26666667,  0.33333333,  0.4       ]])

But it's also quite easy to use np.lib.stride_tricks.as_strided to do the operation you want (relatively performant). This as_strided function also allows to do more complicated operations on the array (if there is no method or function for your case).

For example using an example csr of the scipy documentation:

>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>> row = np.array([0,0,1,2,2,2])
>>> col = np.array([0,2,2,0,1,2])
>>> data = np.array([1.,2,3,4,5,6])
>>> counts = csr_matrix( (data,(row,col)), shape=(3,3) )
>>> counts.todense()
matrix([[ 1.,  0.,  2.],
        [ 0.,  0.,  3.],
        [ 4.,  5.,  6.]])

You can divide each row by it's sum like this:

>>> row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, 
                                                     shape=(counts.shape[0], 2),
                                                     strides=2*counts.indptr.strides)
>>> for start, stop in row_start_stop:   
...    row = counts.data[start:stop]
...    row /= row.sum()
>>> counts.todense()
matrix([[ 0.33333333,  0.        ,  0.66666667],
        [ 0.        ,  0.        ,  1.        ],
        [ 0.26666667,  0.33333333,  0.4       ]])
like image 196
MSeifert Avatar answered Nov 07 '22 18:11

MSeifert


Edit

@MSeifert answer is much more efficient, and that should be the proper way of doing. I think writing counts[i, :] implies some column slicing is done which I didn't realize. The documentation explicitly says those are really inefficient operations on csr_matrix. Way have indeed a good example of that.

Answer

The doc says row slicing is efficient, I think you should do

for i in range(counts.shape[0]):
    counts[i,:] /= counts[i,:].sum()

This way you edit your matrix in place, it stays sparse and you don't have to use vstack. I'm not sure it is the most efficient operation, but at least you should not have memory issues, and there is no slow down effect as the rows are being computed:

import time()
s = time.time()
for i in range(counts.shape[0]):
    counts[i, :] /= (counts[i, :].sum() + 1)
    if i % 1000 == 0:
        e = time.time()
        if i > 0:
            print i, e-s
        s = time.time()

Performances:

1000 6.00199794769 2000 6.02894091606 3000 7.44459486008 4000 7.10011601448 5000 6.16998195648 6000 7.79510307312 7000 7.00139117241 8000 7.37821507454 9000 7.28075814247 ...

Performances for MSeifert's answer:

row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, shape=(counts.shape[0], 2),
                                                 strides=2*counts.indptr.strides)
for i, (start, stop) in enumerate(row_start_stop):
    row = counts.data[start:stop]
    row /= row.sum()
    if i % 1000 == 0:
        e = time.time()
        if i > 0:
            print i,e-s
        s = time.time()

1000 0.00735783576965 2000 0.0108380317688 3000 0.0102109909058 4000 0.0131571292877 5000 0.00670218467712 6000 0.00608897209167 7000 0.00663685798645 8000 0.0164499282837 9000 0.0061981678009 ... As to why using vstack is slow, @MSeifert answer is great.

like image 37
Anis Avatar answered Nov 07 '22 18:11

Anis