Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html
However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.
Am I missing something? Is there a java library which does this?
thanks
Time Complexity Assume computing distance(cosine similarity) between two instances is O(m) where m is the dimensionality of the vectors which corresponds to the average number of terms that are common between two documents. Reassigning clusters requires O(kn) distance computations, or O(knm).
In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. This remains true when using tf–idf weights. The angle between two term frequency vectors cannot be greater than 90°.
Correct! The dot product is proportional to both the cosine and the lengths of vectors. So even though the cosine is higher for “b” and “c”, the higher length of “a” makes "a" and "b" more similar than "b" and "c". You are calculating similarity for music videos.
If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?
Hashmap is good, but it might take a lot of memory.
If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:
i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
if vec1[i].key == vec2[j].key:
result = result + vec1[i].value * vec2[j].value
else if vec1[i].key < vec2[j].key:
i = i + 1
else
j = j + 1
If you are planning on using cosine similarity as a way of finding clusters of similar documents, you may want to consider looking into locality-sensitive hashing, a hash-based approach that was designed specifically with this in mind. Intuitively, LSH hashes the vectors in a way that with high probability places similar elements into the same bucket and distant elements into different buckets. There are LSH schemes that use cosine similarity as their underlying distance, so to find clusters you use LSH to drop things into buckets and then only compute the pairwise distances of elements in the same bucket. In the worst case this will be quadratic (if everything falls in the same bucket), but it's much more likely that you'll have a significant dropoff in work.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With