Computing Jaccard Similarity in Python

Question

I have 20,000 documents that I want to compute the true Jaccard similarity for, so that I can later check how accurately MinWise hashing approximates it.

Each document is represented as a column in a numpy matrix, where each row is a word that either appears in document (entry=1) or does not (entry = 0). There are ~600 words (rows).

So for example column 1 would be [1 0 0 0 0 0 1 0 0 0 1 0], which means words 1,7,11 appeared in it and no others.

Is there a more efficient way to compute the similarity besides my element-wise comparison approach? I don't see how I could use sets to improve the speed since the sets just become (0,1), but as it stands the code is impossibly slow.

import numpy as np

#load file into python
rawdata = np.loadtxt("myfile.csv",delimiter="	")
#Convert the documents from rows to columns
rawdata = np.transpose(rawdata)
#compute true jacard similarity
ndocs = rawdata.shape[1]
nwords = rawdata.shape[0]
tru_sim = np.zeros((ndocs,ndocs))

#computes jaccard similarity of 2 documents
def jaccard(c1, c2):
    n11 = sum((c1==1)&(c2==1))
    n00 = sum((c1==0)&(c2==0))
    jac = n11 / (nfeats-n00)
    return (jac)

for i in range(0,ndocs):
    tru_sim[i,i]=1
    for j in range(i+1,ndocs):
        tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])

Divakar · Accepted Answer

Here's a vectorized approach -

# Get the row, col indices that are to be set in output array        
r,c = np.tril_indices(ndocs,-1)

# Use those indicees to slice out respective columns 
p1 = rawdata[:,c]
p2 = rawdata[:,r]

# Perform n11 and n00 vectorized computations across all indexed columns
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)

# Finally, setup output array and set final division computations
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)

Alternative way to compute n11v and n00v with np.einsum -

n11v = np.einsum('ij,ij->j',(p1==1),(p2==1).astype(int))
n00v = np.einsum('ij,ij->j',(p1==0),(p2==0).astype(int))

If rawdata consists of 0s and 1s only, a simpler way to get them would be -

n11v = np.einsum('ij,ij->j',p1,p2)
n00v = np.einsum('ij,ij->j',1-p1,1-p2)

Benchmarking

Function definitions -

def original_app(rawdata, ndocs, nfeats):
    tru_sim = np.zeros((ndocs,ndocs))
    for i in range(0,ndocs):
        tru_sim[i,i]=1
        for j in range(i+1,ndocs):
            tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
    return tru_sim

def vectorized_app(rawdata, ndocs, nfeats):
    r,c = np.tril_indices(ndocs,-1)
    p1 = rawdata[:,c]
    p2 = rawdata[:,r]
    n11v = ((p1==1) & (p2==1)).sum(0)
    n00v = ((p1==0) & (p2==0)).sum(0)
    out = np.eye(ndocs)
    out[c,r] = n11v / (nfeats-n00v)
    return out

Verification and timings -

In [6]: # Setup inputs
   ...: rawdata = (np.random.rand(20,10000)>0.2).astype(int)
   ...: rawdata = np.transpose(rawdata)
   ...: ndocs = rawdata.shape[1]
   ...: nwords = rawdata.shape[0]
   ...: nfeats = 5
   ...: 

In [7]: # Verify results
   ...: out1 = original_app(rawdata, ndocs, nfeats)
   ...: out2 = vectorized_app(rawdata, ndocs, nfeats)
   ...: print np.allclose(out1,out2)
   ...: 
True

In [8]: %timeit original_app(rawdata, ndocs, nfeats)
1 loops, best of 3: 8.72 s per loop

In [9]: %timeit vectorized_app(rawdata, ndocs, nfeats)
10 loops, best of 3: 27.6 ms per loop

Some magical 300x+ speedup there!

So, why is it so fast? Well, there are lots of factors involved, most important one being the fact NumPy arrays are built for performance and optimized for vectorized computations. With the proposed approach we are harnessing it quite well and thus seeing such speedups.

Here's one related Q&A that talk about these performance criteria in detail.

Has QUIT--Anony-Mousse · Answer

To compute Jaccard, use:

def jaccard(x,y):
  x = np.asarray(x, np.bool) # Not necessary, if you keep your data
  y = np.asarray(y, np.bool) # in a boolean array already!
  return np.double(np.bitwise_and(x, y).sum()) / np.double(np.bitwise_or(x, y).sum())

print jaccard([1,1,0,0,0],[0,1,0,0,1])
>>> 0.33333333333333331

Computing Jaccard Similarity in Python

Tags:

performance

python

vectorization

numpy

data-mining

Magic8ball

Video Answer

2 Answers

Divakar

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

Computing Jaccard Similarity in Python

Tags:

performance

python

vectorization

numpy

data-mining

Magic8ball

Video Answer

2 Answers

Divakar

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us