Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In general, when does TF-IDF reduce accuracy?

I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it?

like image 608
Train Heartnet Avatar asked Jun 08 '26 05:06

Train Heartnet


1 Answers

The IDF component of TF*IDF can harm your classification accuracy in some cases.

Let suppose the following artificial, easy classification task, made for the sake of illustration:

  • Class A: texts containing the word 'corn'
  • Class B: texts not containing the word 'corn'

Suppose now that in Class A, you have 100 000 examples and in class B, 1000 examples.

What will happen to TFIDF? The inverse document frequency of corn will be very low (because it is found in almost all documents), and the feature 'corn' will get a very small TFIDF, which is the weight of the feature used by the classifier. Obviously, 'corn' was THE best feature for this classification task. This is an example where TFIDF may reduce your classification accuracy. In more general terms:

  • when there is class imbalance. If you have more instances in one class, the good word features of the frequent class risk having lower IDF, thus their best features will have a lower weight
  • when you have words with high frequency that are very predictive of one of the classes (words found in most documents of that class)
like image 97
Pascal Soucy Avatar answered Jun 10 '26 10:06

Pascal Soucy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!