I have two large sets of vectors, A
and B
. Each element of A
is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in A
, I'm trying to calculate the cosine similarities to all vectors in B
in order to find the top 5 vectors in B
that best match the given A
vector. For now I'm looping over all of A
, and looping over all of B
, calculating the cosine similarities one-by-one with SciPy's spatial.distance.cosine(a, b)
. Is there a faster way to do this? Perhaps with matrices?
We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.
Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors.
You can transform each vector first in its unit-vector (by dividing it through its length). Then the distance formular simplifies to
d = 1 - e_v * e_w
with e_v = v / ||v||_2 , e_w = w / ||v||_2
which is faster to calculate.
Probably more faster is to use scipy.spatial.distance.cdist(XA, XB, 'cosine')
.
You need to build a matrix from the sets of vectors (pseudo-code):
XA=np.array([vecA1,vecA2,...,vecA400])
XB=np.array([vecB1,vecB2,...,vecB400])
distances = scipy.spatial.distance.cdist(XA, XB, 'cosine')
This is a NAIVE no loop, no overhead(?) implementation of what you need...
from np.linalg import norm
res = 1 - np.dot(A/norm(A, axis=1)[...,None],(B/norm(B,axis=1)[...,None]).T)
Could you please benchmark it on a subset of your data and let us know if it's faster than scipy's cosine distance?
ps, axis=1
above is based on the assumption that your vectors are stored row wise,
print A
# [[1 2 3 4 5 6 7 8 ... 400]
# [2 3 4 5 6 7 8 9 ... 401]
etc
In [79]: A = np.random.random((2,5))
In [80]: A
Out[80]:
array([[ 0.2917865 , 0.89617367, 0.27118045, 0.58596817, 0.05154168],
[ 0.61131638, 0.2859271 , 0.09411264, 0.57995386, 0.09829525]])
In [81]: norm(A,axis=1)
Out[81]: array([ 1.14359988, 0.90018201])
In [82]: norm(A,axis=1)[...,None]
Out[82]:
array([[ 1.14359988],
[ 0.90018201]])
In [83]: A/norm(A,axis=1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-83-707fa10dc673> in <module>()
----> 1 A/norm(A,axis=1)
ValueError: operands could not be broadcast together with shapes (2,5) (2,)
In [84]: A/norm(A,axis=1)[...,None]
Out[84]:
array([[ 0.25514737, 0.78364267, 0.23712878, 0.51238915, 0.04506968],
[ 0.67910309, 0.31763254, 0.10454846, 0.64426289, 0.10919486]])
In [85]: norm(A/norm(A,axis=1)[...,None], axis=1)
Out[85]: array([ 1., 1.])
In [86]:
The session above is for explaining the normalisation procedure,
when we have the normalised matrices A' and B' we take the dot product (we have to transpose the B' matrix of course) and the result is a matrix whose element j, j
is the dot product of NORMALISED vectors A_i and B_j, we subtract from 1
this matrix and we have a matrix of cosine distances. Or so I hope...
In [1]: import numpy as np
In [2]: from numpy.linalg import norm as n
In [3]: from scipy.spatial.distance import cosine
In [4]: A = np.random.random((100,400))
In [5]: B = np.random.random((100,400))
In [6]: C = np.array([[cosine(a,b) for b in B] for a in A])
In [7]: c = 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
In [8]: np.max(C-c)
Out[8]: 8.8817841970012523e-16
In [9]: np.min(C-c)
Out[9]: -8.8817841970012523e-16
In [10]: %timeit [[cosine(a,b) for b in B] for a in A];
1 loops, best of 3: 1.3 s per loop
In [11]: %timeit 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
100 loops, best of 3: 9.28 ms per loop
In [12]:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With