Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scipy sparse matrix alternative for getrow()

I am working with large sparse binary matrices. I have condensed them using Scipy sparse matrix implementation. The calculation of Jaccard distance from scipy.spatial.distance does not support direct operation on sparse matrices, so either:

  1. convert the entire sparse matrix to dense and then operate on each row as vectors which is memory hungry

    or

  2. Loop through the sparse, grab each row using getrow() and operate.

    or

  3. Write our own implementation to work on sparse matrices.

To put things to perspective, here is the sample code:

import scipy.spatial.distance as d
import numpy as np
from scipy.sparse import csr_matrix

# benchmark performance 
X = np.random.random((3000, 3000))
# binarize
X[X > 0.3] = 0
X[X>0] = 1
mat =  csr_matrix(X)

a = np.zeros(3000)
a[4] = a[100] = a[22] =1
a = csr_matrix(a)

def jaccard_fast(v1,v2):
    common = v1.dot(v2.T)
    dis = (v1 != v2).getnnz()
    if common[0,0]:
        return 1.0-float(common[0,0])/float(common[0,0]+dis)
    else:
        return 0.0
    
def benchmark_jaccard_fast():
    for i in range(mat.shape[0]):
        jaccard_fast(mat.getrow(i),a)
        
def benchmark_jaccard_internal_todense():
    for v1,v2 in zip(mat.todense(),a.todense()):
         d.jaccard(v1,v2)
        
def benchmark_jaccard_internal_getrow():
    for i in range(mat.shape[0]):
        d.jaccard(mat.getrow(i),a)
        

print "Jaccard Fast:"
%time benchmark_jaccard_fast()
print "Jaccard Scipy (expanding to dense):"
%time benchmark_jaccard_internal_todense()
print "Jaccard Scipy (using getrow):"
%time benchmark_jaccard_internal_getrow()

where jaccard_fast is my own implementation. It appears that my implementation is faster than the internal one, on scipy sparse matrices, however getrow() seems to slow my implementation down. As I benchmark jaccard_fast against scipy.spatial.distance.jaccard, results are:

Jaccard Fast:
CPU times: user 1.28 s, sys: 0 ns, total: 1.28 s
Wall time: 1.28 s
Jaccard Scipy (expanding to dense):
CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 37.2 ms
Jaccard Scipy (using getrow):
CPU times: user 1.82 s, sys: 0 ns, total: 1.82 s
Wall time: 1.81 s

Any help on how to avoid the getrow bottleneck would be appreciated. I cannot afford to expand my sparse matrix using todense() due to memory limitations.

like image 946
Segmented Avatar asked Oct 19 '22 10:10

Segmented


1 Answers

Sparse indexing is known for being slower, e.g. How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

In [33]: timeit for row in mat: x=row  # sparse iteration
1 loops, best of 3: 510 ms per loop

In [35]: timeit for row in mat.todense(): x=row  # dense iteration
10 loops, best of 3: 175 ms per loop

but I find that your d.jacard is also slower when using sparse matrices

In [36]: ad=a.todense()

In [37]: timeit for row in mat.todense(): d.jaccard(row,ad) # all dense
1 loops, best of 3: 734 ms per loop

In [38]: timeit for row in mat: d.jaccard(row.todense(),ad) # inner dense
1 loops, best of 3: 1.69 s per loop

In [39]: timeit for row in mat: d.jaccard(row,a) # all sparse
1 loops, best of 3: 4.61 s per loop

Eliminating the getrow factor

In [40]: mrow=mat.getrow(0)

In [41]: mrowd=mrow.todense()

In [42]: timeit d.jaccard(mrow, a)  # one sparse row
1000 loops, best of 3: 1.32 ms per loop

In [43]: timeit d.jaccard(mrow.todense(), a.todense())  # with conversion
1000 loops, best of 3: 539 µs per loop

In [44]: timeit d.jaccard(mrowd, ad)  #  dense
10000 loops, best of 3: 173 µs per loop

======================

I need to rerun those tests because d.jaccard does not work with sparse (and jaccard_fast does not work with dense). So separating the sparse row iteration issues from the jaccard calculation will take more work.

I've reworked jaccard_fast a bit:

def my_jaccard(mat, a):
    common = mat*a.T # sparse does the large matrix product well 
    dis=np.array([(row!=a).getnnz() for row in mat]) # iterative
    cA = common.A.ravel()
    return 1 - cA/(cA + dis)

It returns values that match d.jaccard run on the dense rows. d.jaccard returns 1 for the rows where common is 0. I don't seem to need a cA test (unless it is possible that both cA and dis are 0 at the same slot).

In [141]: r=np.array([d.jaccard(row,ad) for row in mat.todense()])

In [142]: r1=my_jaccard(mat,a)

In [143]: np.allclose(r,r1)
Out[143]: True

and speed is only half. If I can rework the dis calc is should have similar speed.

In [144]: timeit r=np.array([d.jaccard(row,ad) for row in mat.todense()])
1 loops, best of 3: 783 ms per loop

In [145]: timeit r1=my_jaccard(mat,a)
1 loops, best of 3: 1.29 s per loop

A further tweak to the calc. I mask out the common values that are 0. This has 2 purposes - it ensures we don't have a divide by 0 problem, and it reduces the number of dis iterations, giving a small speed improvement.

def my_jaccard(mat, a):
    common=mat*a.T
    cA = common.A.ravel()
    mask = cA!=0
    cA = cA[mask]
    dis = np.array([(row!=a).getnnz() for row, b in zip(mat,mask) if b])
    ret = np.ones(mat.shape[0])
    ret[mask] = 1 - cA/(cA+dis)
    return ret

with this the time drops a bit.

In [188]: timeit my_jaccard(mat,a)
1 loops, best of 3: 1.04 s per loop

==================

There's an overlap in issues with Python - Efficient Function with scipy sparse Matrices

In that question I looked at comparing a sparse matrix with a 1 row matrix, and found that using sparse.kron to replicate the row, was the fastest way to replicated numpy broadcasting.

Using that idea in jaccard to calculate the dis array

def my_jaccard1(mat, a):
    common = mat*a.T
    cA = common.A.ravel()
    aM = sparse.kron(a,np.ones((mat.shape[0],1),int))
    dis = (mat!=aM).sum(1)
    ret = 1-cA/(cA+dis.A1)
    return ret    

With this timings improve significantly (10x):

In [318]: timeit my_jaccard1(mat,a)
1 loops, best of 3: 97.1 ms per loop

I can apply masking as before to protect against divide by zero; but it actually slows down the calculation (to 140ms).

def my_jaccard3(mat, a):
    common = mat*a.T
    cA = common.A.ravel()
    mask = cA!=0
    cA = cA[mask]
    aM = sparse.kron(a,np.ones((len(cA),1),int))
    dis = (mat[mask,:]!=aM).sum(1)
    ret = np.ones(mat.shape[0])
    ret[mask] = 1 - cA/(cA+dis.A1) 
    return ret  

========================

edit - test of the suspected case

In [75]: x,y= np.array([1,1,0,0,1,0]), np.array([0,0,1,0,1,0])

In [76]: d.jaccard(x,y)
Out[76]: 0.75

In [78]: jaccard_fast(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[78]: 0.75

My versions:

In [79]: my_jaccard(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[79]: array([ 0.75])
...
In [82]: my_jaccard3(sparse.csr_matrix(x),sparse.csr_matrix(y))
Out[82]: array([ 0.75])

(edit - explicitly use sparse.kron)

like image 100
hpaulj Avatar answered Oct 24 '22 08:10

hpaulj