I have 20,000 documents that I want to compute the true Jaccard similarity for, so that I can later check how accurately MinWise hashing approximates it.
Each document is represented as a column in a numpy matrix, where each row is a word that either appears in document (entry=1) or does not (entry = 0). There are ~600 words (rows).
So for example column 1 would be [1 0 0 0 0 0 1 0 0 0 1 0], which means words 1,7,11 appeared in it and no others.
Is there a more efficient way to compute the similarity besides my element-wise comparison approach? I don't see how I could use sets to improve the speed since the sets just become (0,1), but as it stands the code is impossibly slow.
import numpy as np
#load file into python
rawdata = np.loadtxt("myfile.csv",delimiter="\t")
#Convert the documents from rows to columns
rawdata = np.transpose(rawdata)
#compute true jacard similarity
ndocs = rawdata.shape[1]
nwords = rawdata.shape[0]
tru_sim = np.zeros((ndocs,ndocs))
#computes jaccard similarity of 2 documents
def jaccard(c1, c2):
n11 = sum((c1==1)&(c2==1))
n00 = sum((c1==0)&(c2==0))
jac = n11 / (nfeats-n00)
return (jac)
for i in range(0,ndocs):
tru_sim[i,i]=1
for j in range(i+1,ndocs):
tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
Here's a vectorized approach -
# Get the row, col indices that are to be set in output array
r,c = np.tril_indices(ndocs,-1)
# Use those indicees to slice out respective columns
p1 = rawdata[:,c]
p2 = rawdata[:,r]
# Perform n11 and n00 vectorized computations across all indexed columns
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)
# Finally, setup output array and set final division computations
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)
Alternative way to compute n11v
and n00v
with np.einsum
-
n11v = np.einsum('ij,ij->j',(p1==1),(p2==1).astype(int))
n00v = np.einsum('ij,ij->j',(p1==0),(p2==0).astype(int))
If rawdata
consists of 0s
and 1s
only, a simpler way to get them would be -
n11v = np.einsum('ij,ij->j',p1,p2)
n00v = np.einsum('ij,ij->j',1-p1,1-p2)
Benchmarking
Function definitions -
def original_app(rawdata, ndocs, nfeats):
tru_sim = np.zeros((ndocs,ndocs))
for i in range(0,ndocs):
tru_sim[i,i]=1
for j in range(i+1,ndocs):
tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
return tru_sim
def vectorized_app(rawdata, ndocs, nfeats):
r,c = np.tril_indices(ndocs,-1)
p1 = rawdata[:,c]
p2 = rawdata[:,r]
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)
return out
Verification and timings -
In [6]: # Setup inputs
...: rawdata = (np.random.rand(20,10000)>0.2).astype(int)
...: rawdata = np.transpose(rawdata)
...: ndocs = rawdata.shape[1]
...: nwords = rawdata.shape[0]
...: nfeats = 5
...:
In [7]: # Verify results
...: out1 = original_app(rawdata, ndocs, nfeats)
...: out2 = vectorized_app(rawdata, ndocs, nfeats)
...: print np.allclose(out1,out2)
...:
True
In [8]: %timeit original_app(rawdata, ndocs, nfeats)
1 loops, best of 3: 8.72 s per loop
In [9]: %timeit vectorized_app(rawdata, ndocs, nfeats)
10 loops, best of 3: 27.6 ms per loop
Some magical 300x+
speedup there!
So, why is it so fast? Well, there are lots of factors involved, most important one being the fact NumPy arrays are built for performance and optimized for vectorized computations. With the proposed approach we are harnessing it quite well and thus seeing such speedups.
Here's one related Q&A
that talk about these performance criteria in detail.
To compute Jaccard, use:
def jaccard(x,y):
x = np.asarray(x, np.bool) # Not necessary, if you keep your data
y = np.asarray(y, np.bool) # in a boolean array already!
return np.double(np.bitwise_and(x, y).sum()) / np.double(np.bitwise_or(x, y).sum())
print jaccard([1,1,0,0,0],[0,1,0,0,1])
>>> 0.33333333333333331
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With