How do I calculate tf-idf
for a query? I understand how to calculate tf-idf for a set of documents with following definitions:
tf = occurances in document/ total words in document
idf = log(#documents / #documents where term occurs
But I don't understand how that correlates to queries.
For example, I read a resource that stated the values of a query "life learning
"
life | tf = .5 | idf = 1.405507153 | tf_idf = 0.702753576
learning | tf = .5 | idf = 1.405507153 | tf_idf = 0.702753576
The tf
values I understand, each term appears only once out of the two possible terms, thus 1/2, But I have no idea where the idf
comes from.
I would think that #documents = 1 and occurrence = 1, log(1) = 0, so idf
would be 0, but this doesn't seem to be the case. Is it based on whatever documents you're using? How do you calculate tf-idf for a query?
Even if this question is marked as answered. I don't feel like it was fully answered. So if maybe anyone will need this in the future:
But I have no idea where the idf comes from.
In this example: Project 3, part 2: Searching using TF-IDF It is presented how to compute the cosine similarity between a query and a set of documents.
As @hypnoticpoisons stated the IDF is a a global component, so the IDF of a word will be the same for each document:
Note: technically, we are treating the query as if it were a new document. However, you should not recompute the IDF values: just use the ones you computed earlier.
Assume your query is best car insurance, your total vocabulary contains car, best, auto, insurance and you have N=1,000,000
documents. So your query is something like below:
And one of your document could be:
Now calculate cosine similarity between TF-IDF
of your Query
and Document
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With