I have a set of 100 thousand vectors and I need to retrieve top-25 closest vector based on cosine similarity.
Scipy and Sklearn have implementations for computing cosine distance/similarity 2 vectors but I will need to compute the Cosine Sim for 100k X 100k size and then take out top-25. Is there any fast implemenet in python compute that?
As per @Silmathoron Suggestion, this is what I am doing -
#vectors is a list of vectors of size : 100K x 400 i.e. 100K vectors each of dimenions 400
vectors = numpy.array(vectors)
similarity = numpy.dot(vectors, vectors.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag
k = 26
box_plot_file = file("box_data.csv","w+")
for sim,query in itertools.izip(cosine,queries):
k_largest = heapq.nlargest(k, sim)
k_largest = map(str,k_largest)
result = query + "," + ",".join(k_largest) + "\n"
box_plot_file.write(result)
box_plot_file.close()
We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.
The cosine similarity is a number between 0 and 1 and is commonly used in plagiarism detection. A document is converted to a vector in where n is the number of unique words in the documents in question.
However, in such circumstances, cosine similarity is bijective with Euclidean distance, so there's no real advantage to one over the other theoretically; in practice, cosine similarity is faster then.
I would try smarter algorithms first, rather than speeding up brute force (computing all pairs of vectors). KDTrees might work, scipy.spatial.KDTree(), if your vectors are of low dimension. If they are high dimension then you might need a random projection first: http://scikit-learn.org/stable/modules/random_projection.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With