Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are Cosine Similarity and TF-IDF used together?

TF-IDF and Cosine Similarity is a commonly used combination for text clustering. Each document is represented by vectors of TF-IDF weights.

This is what my text book says.

With Cosine Similarity you can then compute the similarities between those documents.

But why are exactly those techniques used together?
What is the advantage?

Could for example Jaccard Similarity also be used?

I know, how it works, but I want to know, why exactly these techniques.

like image 358
Evgenij Reznik Avatar asked Sep 25 '22 10:09

Evgenij Reznik


People also ask

Does cosine similarity use TF-IDF?

TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term".

Could the cosine similarity be negative when using TF-IDF vector representations explain your answer?

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. This remains true when using tf–idf weights.

Why use cosine similarity instead of Euclidean distance?

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Why might we use TF-IDF?

TF-IDF enables us to gives us a way to associate each word in a document with a number that represents how relevant each word is in that document. Then, documents with similar, relevant words will have similar vectors, which is what we are looking for in a machine learning algorithm.


1 Answers

TF-IDF is the weighting used.

Cosine is the measure used.

You could use cosine without weighting, but results then usually are worse. Jaccard works on sets - it's not obvious how to use weights without turning it into something else without making it the same as Cosine.

like image 175
Has QUIT--Anony-Mousse Avatar answered Nov 15 '22 07:11

Has QUIT--Anony-Mousse