Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to use tf-idf with Naive Bayes?

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :

Link 1

Link 2

Link 3

Link 4

etc.

Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:

Naive-Bayes formula :

P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))

tf-idf weighting can be employed in the above formula as:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class) 

total_unique_words_in_all_classes : as is.

This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.

Please suggest me. I am new to this domain.

like image 458
POOJA GUPTA Avatar asked May 24 '16 06:05

POOJA GUPTA


People also ask

How is TF-IDF used in naive Bayes?

tf-idf weighting can be employed in the above formula as: word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

Can we use TF-IDF for classification?

TF-IDF or ( Term Frequency(TF) — Inverse Dense Frequency(IDF) )is a technique which is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

Why we use naive Bayes for sentiment analysis?

Naive Bayes is the simplest and fastest classification algorithm for a large chunk of data. In various applications such as spam filtering, text classification, sentiment analysis, and recommendation systems, Naive Bayes classifier is used successfully.

How do you use a TF-IDF Vectorizer?

TF-IDF Vectorizer is a measure of originality of a word by comparing the number of times a word appears in document with the number of documents the word appears in. formula for TF-IDF is: TF-IDF = TF(t, d) x IDF(t), where, TF(t, d) = Number of times term "t" appears in a document "d".


3 Answers

It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:

  1. You have a number of documents, each of which has a number of words.
  2. You would like to classify documents into categories.
  3. Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.

Your Solution

The tf idf you gave is the following:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)

Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.

Another potential Solution

It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.

If you wanted to use this method:

  1. Compute mean, variation of tf-idf values for each class.
  2. Compute the prior using a gaussian distribution generated by the above mean and variation.
  3. Proceed as normal (multiply to prior) and predict values.

Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.

Additional methods to increase

Apart from the above, you could also use the following techniques to increase accuracy:

  1. Preprocessing:

    1. Feature reduction (usually NMF, PCA, or LDA)
    2. Additional features
  2. Algorithm:

    Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression

  3. Misc.

    Bootstrapping, boosting, etc. Be careful not to overfit though...

Hopefully this was helpful. Leave a comment if anything was unclear

like image 133
jrhee17 Avatar answered Oct 16 '22 12:10

jrhee17


P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes (basically vocabulary of words in the entire training set))

How would this sum up to 1? If using the above conditional probabilities, I assume the SUM is

P(word1|class)+P(word2|class)+...+P(wordn|class) = (total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)

To correct this, I think the P(word|class) should be like

(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))

Please correct me if I am wrong.

like image 2
alex Avatar answered Oct 16 '22 13:10

alex


I think there are two ways to do it:

  1. Round down tf-idf as integers, then use the multinomial distribution for the conditional probabilities. See this paper https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf.
  2. Use Dirichlet distribution which is a continuous version of the multinomial distribution for the conditional probabilities.

I am not sure if Gaussian mixture will be better.

like image 1
Guojun Zhang Avatar answered Oct 16 '22 11:10

Guojun Zhang