Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute Jaccard distances on sparse matrix

Tags:

I have a large sparse matrix - using sparse.csr_matrix from scipy. The values are binary. For each row, I need to compute the Jaccard distance to every row in the same matrix. What's the most efficient way to do this? Even for a 10.000 x 10.000 matrix, my runtime takes minutes to finish.

Current solution:

def jaccard(a, b):
    intersection = float(len(set(a) & set(b)))
    union = float(len(set(a) | set(b)))
    return 1.0 - (intersection/union)

def regions(csr, p, epsilon):
    neighbors = []
    for index in range(len(csr.indptr)-1):
        if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon:
            neighbors.append(index)
    return neighbors
csr = scipy.sparse.csr_matrix("file")
regions(csr, 0.51) #this is called for every row