Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating IDF (as in TF-IDF) when testing?

As I understand it, IDF is used to calculate how many documents have the term (sort of just the idea). You can calculate IDF (along with TF) in the training set since you have all the documents beforehand. But what if I don't have the test set beforehand and I'm getting test documents in a sequential manner (like from a web crawler), then how am I going to calculate the IDF for words in a document when it comes to testing?

like image 801
samsamara Avatar asked Apr 11 '12 14:04

samsamara


1 Answers

For this state if your dataset is big enough you could using just training set for IDF. in the test phase if the new term be in train set use the IDF of training and if the term is new use the number of train set documents for calculate IDF. For some purposes you could use smoothing methods for having better results.

like image 161
MRFS Avatar answered Sep 30 '22 01:09

MRFS