Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity.

I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."

Now I'm wondering....aren't they 2 different things?

Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.

I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?

like image 758
N00programmer Avatar asked Jun 06 '11 17:06

N00programmer


People also ask

Does cosine similarity use TF-IDF?

TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term".

What is difference between TF and IDF?

To summarize the key intuition motivating TF-IDF is the importance of a term is inversely related to its frequency across documents.TF gives us information on how often a term appears in a document and IDF gives us information about the relative rarity of a term in the collection of documents.

What is better than TF-IDF?

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data.

Which similarity metric is usually considered suitable for measuring the similarities between TF-IDF vectors?

The cosine similarity is very popular in text analysis. It is used to determine how similar documents are to one another irrespective of their size. The TF-IDF text analysis technique helps converting the documents into vectors where each value in the vector corresponds to the TF-IDF score of a word in the document.


2 Answers

Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.

If d2 and q are tf-idf vectors, then

enter image description here

where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.

There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.

(Formula taken from the Wikipedia, hence the d2.)

like image 117
Fred Foo Avatar answered Sep 28 '22 05:09

Fred Foo


TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).

To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.

You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)

TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):

IDF, credit to wikipedia

Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).

Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).

The TF-IDF of 'egg' is:

1 * log (100/99) = 0.01    # document 1 2 * log (100/99) = 0.02    # document 2 1 * log (100/99) = 0.01    # document 3 

These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.

The TF-IDF for 'arugula' is:

1 * log (100/9) = 2.40  # document 1 2 * log (100/9) = 4.81  # document 2 0 * log (100/9) = 0     # document 3 

'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.

The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').

like image 45
rcreswick Avatar answered Sep 28 '22 05:09

rcreswick