Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I calculate the cosine similarity of two vectors?

People also ask

What is cosine similarity formula?

We define cosine similarity mathematically as the dot product of the vectors divided by their magnitude. For example, if we have two vectors, A and B, the similarity between them is calculated as: s i m i l a r i t y ( A , B ) = c o s ( θ ) = A ⋅ B ‖ A ‖ ‖ B ‖ where.

What is the cosine similarity between a vector and itself?

The cosine similarity is the cosine of the angle between vectors. The vectors are typically non-zero and are within an inner product space. The cosine similarity is described mathematically as the division between the dot product of vectors and the product of the euclidean norms or magnitude of each vector.

How do you find the cosine similarity between two lists?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.


If you want to avoid relying on third-party libraries for such a simple task, here is a plain Java implementation:

public static double cosineSimilarity(double[] vectorA, double[] vectorB) {
    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vectorA.length; i++) {
        dotProduct += vectorA[i] * vectorB[i];
        normA += Math.pow(vectorA[i], 2);
        normB += Math.pow(vectorB[i], 2);
    }   
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Note that the function assumes that the two vectors have the same length. You may want to explictly check it for safety.


Have a look at: http://en.wikipedia.org/wiki/Cosine_similarity.

If you have vectors A and B.

The similarity is defined as:

cosine(theta) = A . B / ||A|| ||B||

For a vector A = (a1, a2), ||A|| is defined as sqrt(a1^2 + a2^2)

For vector A = (a1, a2) and B = (b1, b2), A . B is defined as a1 b1 + a2 b2;

So for vector A = (a1, a2) and B = (b1, b2), the cosine similarity is given as:

  (a1 b1 + a2 b2) / sqrt(a1^2 + a2^2) sqrt(b1^2 + b2^2)

Example:

A = (1, 0.5), B = (0.5, 1)

cosine(theta) = (0.5 + 0.5) / sqrt(5/4) sqrt(5/4) = 4/5

public class CosineSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
    double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
    double eucledianDist = sourceDoc.normF() * targetDoc.normF();
    return dotProduct / eucledianDist;
  }
}

I did some tf-idf stuff recently for my Information Retrieval unit at University. I used this Cosine Similarity method which uses Jama: Java Matrix Package.

For the full source code see IR Math with Java : Similarity Measures, really good resource that covers a good few different similarity measurements.


For matrix code in Java I'd recommend using the Colt library. If you have this, the code looks like (not tested or even compiled):

DoubleMatrix1D a = new DenseDoubleMatrix1D(new double[]{1,0.5}});
DoubleMatrix1D b = new DenseDoubleMatrix1D(new double[]{0.5,1}});
double cosineDistance = a.zDotProduct(b)/Math.sqrt(a.zDotProduct(a)*b.zDotProduct(b))

The code above could also be altered to use one of the Blas.dnrm2() methods or Algebra.DEFAULT.norm2() for the norm calculation. Exactly the same result, which is more readable depends on taste.


When I was working with text mining some time ago, I was using the SimMetrics library which provides an extensive range of different metrics in Java. If it happened that you need more, then there is always R and CRAN to look at.

But coding it from the description in the Wikipedia is rather trivial task, and can be a nice exercise.