Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vector Space Model: Cosine Similarity vs Euclidean Distance

I have corpora of classified text. From these I create vectors. Each vector corresponds to one document. Vector components are word weights in this document computed as TFIDF values. Next I build a model in which every class is presented by a single vector. Model has as many vectors as there classes in the corpora. Component of a model vector is computed as mean of all component values taken from vectors in this class. For unclassified vectors I determine similarity with a model vector by computing cosine between these vectors.

Questions:

1) Can I use Euclidean Distance between unclassified and model vector to compute their similarity?

2) Why Euclidean distance can not be used as similarity measure instead of cosine of angle between two vectors and vice versa?

Thanks!

like image 586
Anton Ashanin Avatar asked Oct 16 '13 17:10

Anton Ashanin


People also ask

Which is better cosine similarity or Euclidean distance?

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word 'cricket' appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

Is cosine similarity faster than Euclidean distance?

However, in such circumstances, cosine similarity is bijective with Euclidean distance, so there's no real advantage to one over the other theoretically; in practice, cosine similarity is faster then.

Is cosine distance same as Euclidean distance?

While cosine looks at the angle between vectors (thus not taking into regard their weight or magnitude), euclidean distance is similar to using a ruler to actually measure the distance.

What is better than cosine similarity?

However, the Euclidean distance measure will be more effective and it indicates that A' is more closer (similar) to B' than C'. As can be seen from the above output, the Cosine similarity measure was same but the Euclidean distance suggests points A and B are closer to each other and hence similar to each other.


1 Answers

One informal but rather intuitive way to think about this is to consider the 2 components of a vector: direction and magnitude.

Direction is the "preference" / "style" / "sentiment" / "latent variable" of the vector, while the magnitude is how strong it is towards that direction.

When classifying documents we'd like to categorize them by their overall sentiment, so we use the angular distance.

Euclidean distance is susceptible to documents being clustered by their L2-norm (magnitude, in the 2 dimensional case) instead of direction. I.e. vectors with quite different directions would be clustered because their distances from origin are similar.

like image 115
kizzx2 Avatar answered Sep 16 '22 13:09

kizzx2