Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cosine Similarity of Vectors, with < O(n^2) complexity

Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html

However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.

Am I missing something? Is there a java library which does this?

thanks

like image 729
Ash Avatar asked Jul 27 '10 18:07

Ash


People also ask

What is the time complexity of cosine similarity?

Time Complexity Assume computing distance(cosine similarity) between two instances is O(m) where m is the dimensionality of the vectors which corresponds to the average number of terms that are common between two documents. Reassigning clusters requires O(kn) distance computations, or O(knm).

Is cosine similarity always between 0 and 1?

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. This remains true when using tf–idf weights. The angle between two term frequency vectors cannot be greater than 90°.

Is cosine similarity just dot product?

Correct! The dot product is proportional to both the cosine and the lengths of vectors. So even though the cosine is higher for “b” and “c”, the higher length of “a” makes "a" and "b" more similar than "b" and "c". You are calculating similarity for music videos.


3 Answers

If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?

like image 67
Nicolas78 Avatar answered Nov 13 '22 19:11

Nicolas78


Hashmap is good, but it might take a lot of memory.

If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:

i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
  if vec1[i].key == vec2[j].key:
    result = result + vec1[i].value * vec2[j].value
  else if vec1[i].key < vec2[j].key:
    i = i + 1
  else
    j = j + 1
like image 44
dmitry_vk Avatar answered Nov 13 '22 18:11

dmitry_vk


If you are planning on using cosine similarity as a way of finding clusters of similar documents, you may want to consider looking into locality-sensitive hashing, a hash-based approach that was designed specifically with this in mind. Intuitively, LSH hashes the vectors in a way that with high probability places similar elements into the same bucket and distant elements into different buckets. There are LSH schemes that use cosine similarity as their underlying distance, so to find clusters you use LSH to drop things into buckets and then only compute the pairwise distances of elements in the same bucket. In the worst case this will be quadratic (if everything falls in the same bucket), but it's much more likely that you'll have a significant dropoff in work.

Hope this helps!

like image 40
templatetypedef Avatar answered Nov 13 '22 19:11

templatetypedef