How to quickly calculate cosine similarity for large number of vectors in Python?

Tags:

I have a set of 100 thousand vectors and I need to retrieve top-25 closest vector based on cosine similarity.

Scipy and Sklearn have implementations for computing cosine distance/similarity 2 vectors but I will need to compute the Cosine Sim for 100k X 100k size and then take out top-25. Is there any fast implemenet in python compute that?

As per @Silmathoron Suggestion, this is what I am doing -

#vectors is a list of vectors of size : 100K x 400 i.e. 100K vectors each of dimenions 400
vectors = numpy.array(vectors)  
similarity = numpy.dot(vectors, vectors.T)


# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)

# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag

k = 26

box_plot_file = file("box_data.csv","w+")

for sim,query in itertools.izip(cosine,queries):
    k_largest = heapq.nlargest(k, sim)
    k_largest = map(str,k_largest)
    result = query + "," + ",".join(k_largest) + "\n"
    box_plot_file.write(result)
box_plot_file.close()

429

asked Jun 25 '16 14:06

silent_dev

1 Answers

I would try smarter algorithms first, rather than speeding up brute force (computing all pairs of vectors). KDTrees might work, scipy.spatial.KDTree(), if your vectors are of low dimension. If they are high dimension then you might need a random projection first: http://scikit-learn.org/stable/modules/random_projection.html

answered Oct 22 '22 02:10

ericf

Related questions
                            
                                Apply control characters to a string - Python
                            
                                How to not store password in .pypirc?
                            
                                How to serialize a pyspark Pipeline object?
                            
                                Scrapy suppress handled errors
                            
                                Detecting whether a Flask app handles a URL
                            
                                expected string or buffer ,date_re.match(value) django error
                            
                                Python, Pandas: tz_localize AmbiguousTimeError: Cannot infer dst time with non DST dates
                            
                                ggplot python handling time data over many weeks at hourly resolution
                            
                                Force celery to use json in place of pickle
                            
                                How to speed up resample procedure in Pandas?
                            
                                iPython notebook not working in Pycharm
                            
                                Can't see application log in Google Cloud Logs
                            
                                How to initialise an integer array.array object with zeros in Python
                            
                                Combing 2D list of tuples and then sorting them in Python
                            
                                In Tensorflow, how to unravel the flattened indices obtained by tf.nn.max_pool_with_argmax?
                            
                                Watching generation lists during a program run
                            
                                python libclang bindings on Windows fail to initialize a translation unit from sublime text
                            
                                How to extract data from SQL query and assign it to Odoo class columns?
                            
                                How to identify non-printable KeyPress events in Tkinter
                            
                                How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to quickly calculate cosine similarity for large number of vectors in Python?

Tags:

python

vector

scipy

scikit-learn

sklearn-pandas

silent_dev

People also ask

1 Answers

ericf

Recent Activity

Donate For Us