I have n vectors, each with m elements (real number). I want to find the pair where there cosine similarity is maximum among all pairs.
The straightforward solution would require O(n2m) time.
Is there any better solution?
update
Cosine similarity / distance and triangle equation Inspires me that I could replace "cosine similarity" with "chord length" which loses precision but increases speed a lot. ( there are many existing solutions solving Nearest Neighbor in metric space, like ANN )
The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.
2.4. Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
The higher similarity, the lower distances. When you pick the threshold for similarities for text/documents, usually a value higher than 0.5 shows strong similarities.
Cosine similarity sim(a,b)
is related to Euclidean distance |a - b|
by
|a - b|² = 2(1 - sim(a,b))
for unit vectors a
and b
.
That means cosine similarity is greatest when Euclidean distance is smallest after normalizing by the L2 norm, and the problem reduces to the closest pair of points problem, which can be solved in O(n lg n) time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With