Python Optimized Most Cosine Similar Vector

Tags:

I have about 30,000 vectors and each vector has about 300 elements.

For another vector (with same number elements), how can I efficiently find the most (cosine) similar vector?

This following is one implementation using a python loop:

from time import time
import numpy as np

vectors = np.load("np_array_of_about_30000_vectors.npy")
target = np.load("single_vector.npy")
print vectors.shape, vectors.dtype  # (35196, 312) float3
print target.shape, target.dtype  # (312,) float32

start_time = time()
for i, candidate in enumerate(vectors):
    similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
    if similarity > max_similarity: 
        max_similarity = similarity 
        max_index = i
print "done with loop in %s seconds" % (time() - start_time)  # 0.466356039047 seconds
print "Most similar vector to target is index %s with %s" % (max_index, max_similarity)  #  index 2399 with 0.772758982696

The following with removed python loop is 44x faster, but isn't the same computation:

print "starting max dot"
start_time = time()
print(np.max(np.dot(vectors, target)))
print "done with max dot in %s seconds" % (time() - start_time)  # 0.0105748176575 seconds

Is there a way to get this speedup associated with numpy doing the iterations without loosing the max index logic and the division of the normal product? For optimizing calculations like this, would it make sense to just do the calculations in C?

897

asked Nov 24 '18 06:11

JDiMatteo

2 Answers

You have the correct idea about avoiding the loop to get performance. You can use argmin to get the minimum distance index.

Though, I would change the distance calculation to scipy cdist as well. This way you can calculate distances to multiple targets and would be able to choose from several distance metrics, if need be.

import numpy as np
from scipy.spatial import distance

distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance

HTH.

130

answered Sep 30 '22 14:09

Deepak Saini

Edit: Hats off to @Deepak. cdist is the fastest, if you do need the actual computed value.

from scipy.spatial import distance

start_time = time()
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

done with loop in 0.013602018356323242 seconds

Most similar vector to target is index 11001 with 0.2250217098612361

from time import time
import numpy as np

vectors = np.random.normal(0,100,(35196,300))
target = np.random.normal(0,100,(300))

start_time = time()
myvals = np.dot(vectors, target)
max_index = np.argmax(myvals)
max_similarity = myvals[max_index]
print("done with max dot in %s seconds" % (time() - start_time) )
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

done with max dot in 0.009701013565063477 seconds

Most similar vector to target is index 12187 with 645549.917200941

max_similarity = 1e-10
start_time = time()
for i, candidate in enumerate(vectors):
    similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
    if similarity > max_similarity: 
        max_similarity = similarity 
        max_index = i
print("done with loop in %s seconds" % (time() - start_time))
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

done with loop in 0.49567198753356934 seconds

Most similar vector to target is index 11001 with 0.2250217098612361

def my_func(candidate,target):
    return np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
start_time = time()
out = np.apply_along_axis(my_func, 1, vectors,target)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

done with loop in 0.7495708465576172 seconds

Most similar vector to target is index 11001 with 0.2250217098612361

start_time = time()
vnorm = np.linalg.norm(vectors,axis=1)
tnorm = np.linalg.norm(target)
tnorm = np.ones(vnorm.shape)
out = np.matmul(vectors,target)/(vnorm*tnorm)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

done with loop in 0.04306602478027344 seconds

Most similar vector to target is index 11001 with 0.2250217098612361

answered Sep 30 '22 12:09

pangyuteng

Related questions
                            
                                Handle invalid/corrupted image files in ImageDataGenerator.flow_from_directory in Keras
                            
                                python logging print traceback only in debug
                            
                                Writing a pyo3 function equivalent to a Python function that returns its input object
                            
                                OverflowError: MongoDB can only handle up to 8-byte ints?
                            
                                BOTO3 - generate_presigned_url for `put_object` return `The request signature we calculated does not match the signature you provided`
                            
                                How can I open a .snappy.parquet file in python?
                            
                                Django Admin Form: Set the default value of a readonly field
                            
                                Why import class from another file will call __init__ function?
                            
                                FastAI library v1 with Google Colab
                            
                                How to install mpl_finance packages into environment on Anaconda?
                            
                                pip install urllib3 hanging on "Caching due to etag"
                            
                                How do I generate python grpc code from within a setuptools installer (setup.py)?
                            
                                How to compute Shannon entropy of Information from a Pandas Dataframe?
                            
                                How does sys.executable determine the interpreter path?
                            
                                From pathlib parts tuple to string path
                            
                                Adding a new column in the first ordinal position in a pyspark dataframe
                            
                                For loop to print old value and sum of old value
                            
                                ValueError: Found array with 0 sample (s) (shape= (0, 1) while a minimum of 1 is required by MinMaxScaler
                            
                                Distributing jobs evenly across multiple GPUs with `multiprocessing.Pool`
                            
                                Modify field names in serializer in Django Rest Framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Optimized Most Cosine Similar Vector

Tags:

python

optimization

numpy

JDiMatteo

People also ask

2 Answers

Deepak Saini

pangyuteng

Recent Activity

Donate For Us