Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to quickly calculate cosine similarity for large number of vectors in Python?

I have a set of 100 thousand vectors and I need to retrieve top-25 closest vector based on cosine similarity.

Scipy and Sklearn have implementations for computing cosine distance/similarity 2 vectors but I will need to compute the Cosine Sim for 100k X 100k size and then take out top-25. Is there any fast implemenet in python compute that?

As per @Silmathoron Suggestion, this is what I am doing -

#vectors is a list of vectors of size : 100K x 400 i.e. 100K vectors each of dimenions 400
vectors = numpy.array(vectors)  
similarity = numpy.dot(vectors, vectors.T)


# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)

# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag

k = 26

box_plot_file = file("box_data.csv","w+")

for sim,query in itertools.izip(cosine,queries):
    k_largest = heapq.nlargest(k, sim)
    k_largest = map(str,k_largest)
    result = query + "," + ",".join(k_largest) + "\n"
    box_plot_file.write(result)
box_plot_file.close()
like image 429
silent_dev Avatar asked Jun 25 '16 14:06

silent_dev


People also ask

How do you find the cosine similarity between two vectors in Python?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.

How do you code cosine similarity in Python?

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.

What is the maximum cosine similarity?

The cosine similarity is a number between 0 and 1 and is commonly used in plagiarism detection. A document is converted to a vector in where n is the number of unique words in the documents in question.

Is cosine similarity faster than Euclidean distance?

However, in such circumstances, cosine similarity is bijective with Euclidean distance, so there's no real advantage to one over the other theoretically; in practice, cosine similarity is faster then.


1 Answers

I would try smarter algorithms first, rather than speeding up brute force (computing all pairs of vectors). KDTrees might work, scipy.spatial.KDTree(), if your vectors are of low dimension. If they are high dimension then you might need a random projection first: http://scikit-learn.org/stable/modules/random_projection.html

like image 51
ericf Avatar answered Oct 22 '22 02:10

ericf