Cosine Similarity of Vectors, with < O(n^2) complexity

Q: What is the time complexity of cosine similarity?

Time Complexity Assume computing distance(cosine similarity) between two instances is O(m) where m is the dimensionality of the vectors which corresponds to the average number of terms that are common between two documents. Reassigning clusters requires O(kn) distance computations, or O(knm).

Q: Is cosine similarity always between 0 and 1?

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. This remains true when using tf–idf weights. The angle between two term frequency vectors cannot be greater than 90°.

Q: Is cosine similarity just dot product?

Correct! The dot product is proportional to both the cosine and the lengths of vectors. So even though the cosine is higher for “b” and “c”, the higher length of “a” makes "a" and "b" more similar than "b" and "c". You are calculating similarity for music videos.

Tags:

java

algorithm

information-retrieval

Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html

However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.

Am I missing something? Is there a java library which does this?

thanks

729

asked Jul 27 '10 18:07

Ash

3 Answers

If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?

answered Nov 13 '22 19:11

Nicolas78

Hashmap is good, but it might take a lot of memory.

If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:

i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
  if vec1[i].key == vec2[j].key:
    result = result + vec1[i].value * vec2[j].value
  else if vec1[i].key < vec2[j].key:
    i = i + 1
  else
    j = j + 1

answered Nov 13 '22 18:11

dmitry_vk

If you are planning on using cosine similarity as a way of finding clusters of similar documents, you may want to consider looking into locality-sensitive hashing, a hash-based approach that was designed specifically with this in mind. Intuitively, LSH hashes the vectors in a way that with high probability places similar elements into the same bucket and distant elements into different buckets. There are LSH schemes that use cosine similarity as their underlying distance, so to find clusters you use LSH to drop things into buckets and then only compute the pairwise distances of elements in the same bucket. In the worst case this will be quadratic (if everything falls in the same bucket), but it's much more likely that you'll have a significant dropoff in work.

Hope this helps!

answered Nov 13 '22 19:11

templatetypedef

Related questions
                            
                                EventListenerList firing order
                            
                                Java: create graphics without awt?
                            
                                RSA: How to generate private key in java and use it in C#?
                            
                                How does one convert from a Java resultset to ColdFusion query in Railo?
                            
                                DDD - Entity state transition
                            
                                DBUnit automatic dataset generation
                            
                                What does this Java generics paradigm do and what is it called?
                            
                                Custom file type and icon
                            
                                How to deal with unknown entity references?
                            
                                Spring Security 3- How to customize username/password parameters?
                            
                                Create on the fly web service, Dynamic web service invoker [closed]
                            
                                Automatically extracting inline XSD from WSDL into XSD file(s)
                            
                                download zip file using java?
                            
                                How to merge 2 XML streams in Java by XSLT
                            
                                How can I map a spring controller to a url with .jsp extension?
                            
                                FileInputStream for a generic file System
                            
                                Spring Roo and aspect-oriented programming
                            
                                Simple Java service discovery framework
                            
                                advice on freemarker templating, want to create a master template
                            
                                Are there any code analysis tools that will make my job easier? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With