Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized cosine similarity calculation in Python

I have two large sets of vectors, A and B. Each element of A is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in A, I'm trying to calculate the cosine similarities to all vectors in B in order to find the top 5 vectors in B that best match the given A vector. For now I'm looping over all of A, and looping over all of B, calculating the cosine similarities one-by-one with SciPy's spatial.distance.cosine(a, b). Is there a faster way to do this? Perhaps with matrices?

like image 489
BoltzmannBrain Avatar asked Dec 03 '15 15:12

BoltzmannBrain


People also ask

How do you find the cosine similarity between two vectors in Python?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.

How do you code cosine similarity in Python?

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.

What does cosine similarity return in Python?

Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors.


2 Answers

You can transform each vector first in its unit-vector (by dividing it through its length). Then the distance formular simplifies to

 d = 1 - e_v * e_w

 with e_v = v / ||v||_2 , e_w = w / ||v||_2 

which is faster to calculate.

Probably more faster is to use scipy.spatial.distance.cdist(XA, XB, 'cosine'). You need to build a matrix from the sets of vectors (pseudo-code):

XA=np.array([vecA1,vecA2,...,vecA400])
XB=np.array([vecB1,vecB2,...,vecB400])
distances = scipy.spatial.distance.cdist(XA, XB, 'cosine')
like image 187
jofel Avatar answered Oct 31 '22 19:10

jofel


This is a NAIVE no loop, no overhead(?) implementation of what you need...

from np.linalg import norm
res = 1 - np.dot(A/norm(A, axis=1)[...,None],(B/norm(B,axis=1)[...,None]).T)

Could you please benchmark it on a subset of your data and let us know if it's faster than scipy's cosine distance?


ps, axis=1 above is based on the assumption that your vectors are stored row wise,

print A
# [[1 2 3 4 5 6 7 8 ... 400]
#  [2 3 4 5 6 7 8 9 ... 401]

etc


Comments

In [79]: A = np.random.random((2,5))

In [80]: A
Out[80]: 
array([[ 0.2917865 ,  0.89617367,  0.27118045,  0.58596817,  0.05154168],
       [ 0.61131638,  0.2859271 ,  0.09411264,  0.57995386,  0.09829525]])

In [81]: norm(A,axis=1)
Out[81]: array([ 1.14359988,  0.90018201])

In [82]: norm(A,axis=1)[...,None]
Out[82]: 
array([[ 1.14359988],
       [ 0.90018201]])

In [83]: A/norm(A,axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-83-707fa10dc673> in <module>()
----> 1 A/norm(A,axis=1)

ValueError: operands could not be broadcast together with shapes (2,5) (2,) 

In [84]: A/norm(A,axis=1)[...,None]
Out[84]: 
array([[ 0.25514737,  0.78364267,  0.23712878,  0.51238915,  0.04506968],
       [ 0.67910309,  0.31763254,  0.10454846,  0.64426289,  0.10919486]])

In [85]: norm(A/norm(A,axis=1)[...,None], axis=1)
Out[85]: array([ 1.,  1.])

In [86]: 

The session above is for explaining the normalisation procedure, when we have the normalised matrices A' and B' we take the dot product (we have to transpose the B' matrix of course) and the result is a matrix whose element j, j is the dot product of NORMALISED vectors A_i and B_j, we subtract from 1 this matrix and we have a matrix of cosine distances. Or so I hope...

Test & Benchmark

In [1]: import numpy as np                                              

In [2]: from numpy.linalg import norm as n

In [3]: from scipy.spatial.distance import cosine

In [4]: A = np.random.random((100,400))

In [5]: B = np.random.random((100,400))

In [6]: C = np.array([[cosine(a,b) for b in B] for a in A])

In [7]: c = 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)

In [8]: np.max(C-c)
Out[8]: 8.8817841970012523e-16

In [9]: np.min(C-c)
Out[9]: -8.8817841970012523e-16

In [10]: %timeit [[cosine(a,b) for b in B] for a in A];
1 loops, best of 3: 1.3 s per loop

In [11]: %timeit 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
100 loops, best of 3: 9.28 ms per loop

In [12]: 
like image 22
gboffi Avatar answered Oct 31 '22 21:10

gboffi