I have a large sparse matrix - using sparse.csr_matrix from scipy. The values are binary. For each row, I need to compute the Jaccard distance to every row in the same matrix. What's the most efficient way to do this? Even for a 10.000 x 10.000 matrix, my runtime takes minutes to finish.
Current solution:
def jaccard(a, b):
intersection = float(len(set(a) & set(b)))
union = float(len(set(a) | set(b)))
return 1.0 - (intersection/union)
def regions(csr, p, epsilon):
neighbors = []
for index in range(len(csr.indptr)-1):
if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon:
neighbors.append(index)
return neighbors
csr = scipy.sparse.csr_matrix("file")
regions(csr, 0.51) #this is called for every row
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With