I have two large sets of vectors, <code>A</code> and <code>B</code>. Each element of <code>A</code> is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in <code>A</code>, I'm trying to calculate the cosine similarities to all vectors in <code>B</code> in order to find the top 5 vectors in <code>B</code> that best match the given <code>A</code> vector. For now I'm looping over all of <code>A</code>, and looping over all of <code>B</code>, calculating the cosine similarities one-by-one with SciPy's <code>spatial.distance.cosine(a, b)</code>. Is there a faster way to do this? Perhaps with matrices?

You can transform each vector first in its unit-vector (by dividing it through its length). Then the distance formular simplifies to <pre class="prettyprint"><code> d = 1 - e_v * e_w with e_v = v / ||v||_2 , e_w = w / ||v||_2 </code></pre> which is faster to calculate. Probably more faster is to use <code>scipy.spatial.distance.cdist(XA, XB, 'cosine')</code>. You need to build a matrix from the sets of vectors (pseudo-code): <pre class="prettyprint"><code>XA=np.array([vecA1,vecA2,...,vecA400]) XB=np.array([vecB1,vecB2,...,vecB400]) distances = scipy.spatial.distance.cdist(XA, XB, 'cosine') </code></pre>

Vectorized cosine similarity calculation in Python

Tags:

python

matrix

cosine-similarity

I have two large sets of vectors, A and B. Each element of A is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in A, I'm trying to calculate the cosine similarities to all vectors in B in order to find the top 5 vectors in B that best match the given A vector. For now I'm looping over all of A, and looping over all of B, calculating the cosine similarities one-by-one with SciPy's spatial.distance.cosine(a, b). Is there a faster way to do this? Perhaps with matrices?

489

asked Dec 03 '15 15:12

BoltzmannBrain

2 Answers

You can transform each vector first in its unit-vector (by dividing it through its length). Then the distance formular simplifies to

 d = 1 - e_v * e_w

 with e_v = v / ||v||_2 , e_w = w / ||v||_2

which is faster to calculate.

Probably more faster is to use scipy.spatial.distance.cdist(XA, XB, 'cosine'). You need to build a matrix from the sets of vectors (pseudo-code):

XA=np.array([vecA1,vecA2,...,vecA400])
XB=np.array([vecB1,vecB2,...,vecB400])
distances = scipy.spatial.distance.cdist(XA, XB, 'cosine')

187

answered Oct 31 '22 19:10

jofel

This is a NAIVE no loop, no overhead(?) implementation of what you need...

from np.linalg import norm
res = 1 - np.dot(A/norm(A, axis=1)[...,None],(B/norm(B,axis=1)[...,None]).T)

Could you please benchmark it on a subset of your data and let us know if it's faster than scipy's cosine distance?

ps, axis=1 above is based on the assumption that your vectors are stored row wise,

print A
# [[1 2 3 4 5 6 7 8 ... 400]
#  [2 3 4 5 6 7 8 9 ... 401]

etc

Comments

In [79]: A = np.random.random((2,5))

In [80]: A
Out[80]: 
array([[ 0.2917865 ,  0.89617367,  0.27118045,  0.58596817,  0.05154168],
       [ 0.61131638,  0.2859271 ,  0.09411264,  0.57995386,  0.09829525]])

In [81]: norm(A,axis=1)
Out[81]: array([ 1.14359988,  0.90018201])

In [82]: norm(A,axis=1)[...,None]
Out[82]: 
array([[ 1.14359988],
       [ 0.90018201]])

In [83]: A/norm(A,axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-83-707fa10dc673> in <module>()
----> 1 A/norm(A,axis=1)

ValueError: operands could not be broadcast together with shapes (2,5) (2,) 

In [84]: A/norm(A,axis=1)[...,None]
Out[84]: 
array([[ 0.25514737,  0.78364267,  0.23712878,  0.51238915,  0.04506968],
       [ 0.67910309,  0.31763254,  0.10454846,  0.64426289,  0.10919486]])

In [85]: norm(A/norm(A,axis=1)[...,None], axis=1)
Out[85]: array([ 1.,  1.])

In [86]:

The session above is for explaining the normalisation procedure, when we have the normalised matrices A' and B' we take the dot product (we have to transpose the B' matrix of course) and the result is a matrix whose element j, j is the dot product of NORMALISED vectors A_i and B_j, we subtract from 1 this matrix and we have a matrix of cosine distances. Or so I hope...

Test & Benchmark

In [1]: import numpy as np                                              

In [2]: from numpy.linalg import norm as n

In [3]: from scipy.spatial.distance import cosine

In [4]: A = np.random.random((100,400))

In [5]: B = np.random.random((100,400))

In [6]: C = np.array([[cosine(a,b) for b in B] for a in A])

In [7]: c = 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)

In [8]: np.max(C-c)
Out[8]: 8.8817841970012523e-16

In [9]: np.min(C-c)
Out[9]: -8.8817841970012523e-16

In [10]: %timeit [[cosine(a,b) for b in B] for a in A];
1 loops, best of 3: 1.3 s per loop

In [11]: %timeit 1.0 - np.dot(A/n(A,axis=1)[:,None],(B/n(B,axis=1)[:,None]).T)
100 loops, best of 3: 9.28 ms per loop

In [12]:

answered Oct 31 '22 21:10

gboffi

Related questions
                            
                                How to convert a numpy array from 'float64' to 'float'
                            
                                Is it necessary to use virtualenv to use Flask framework?
                            
                                Pass additional data while registering user using Djoser and Django Rest Framework
                            
                                How to upload files to new EMR cluster
                            
                                Having trouble installing PyAudio for Python3 on Mint
                            
                                max_help_position is not works in python argparse library
                            
                                Where should i do the django validations for objects and fields?
                            
                                Python - Loading Zip Codes into a DataFrame as Strings?
                            
                                removing an backslash from a string [duplicate]
                            
                                Why is my for loop skipping an element in my list?
                            
                                How to downgrade IPython4 to 3 using Anaconda
                            
                                Start python flask webserver automatically after booting the system and keep it on till the end
                            
                                Accessing an element in django for loop
                            
                                Measure edge strength in OpenCV, magnitude of gradient
                            
                                How to delete user in django?
                            
                                How can I display a user profile using Django? [closed]
                            
                                Why can't I change attribute of a class in Python
                            
                                Celery - AttributeError: 'NoneType' object has no attribute 'delay'
                            
                                How to open and close a website using default browser with python
                            
                                Increase numpy array elements using array as index

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With