TF-IDF
andCosine Similarity
is a commonly used combination for text clustering. Each document is represented by vectors of TF-IDF weights.
This is what my text book says.
With Cosine Similarity you can then compute the similarities between those documents.
But why are exactly those techniques used together?
What is the advantage?
Could for example Jaccard Similarity also be used?
I know, how it works, but I want to know, why exactly these techniques.
TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term".
In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. This remains true when using tf–idf weights.
The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.
TF-IDF enables us to gives us a way to associate each word in a document with a number that represents how relevant each word is in that document. Then, documents with similar, relevant words will have similar vectors, which is what we are looking for in a machine learning algorithm.
TF-IDF is the weighting used.
Cosine is the measure used.
You could use cosine without weighting, but results then usually are worse. Jaccard works on sets - it's not obvious how to use weights without turning it into something else without making it the same as Cosine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With