how to use tf-idf with Naive Bayes?

Tags:

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :

Link 1

Link 2

Link 3

Link 4

etc.

Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:

Naive-Bayes formula :

P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))

tf-idf weighting can be employed in the above formula as:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class) 

total_unique_words_in_all_classes : as is.

This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.

Please suggest me. I am new to this domain.

458

asked May 24 '16 06:05

POOJA GUPTA

3 Answers

It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:

You have a number of documents, each of which has a number of words.
You would like to classify documents into categories.
Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.

Your Solution

The tf idf you gave is the following:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)

Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.

Another potential Solution

It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.

If you wanted to use this method:

Compute mean, variation of tf-idf values for each class.
Compute the prior using a gaussian distribution generated by the above mean and variation.
Proceed as normal (multiply to prior) and predict values.

Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.

Additional methods to increase

Apart from the above, you could also use the following techniques to increase accuracy:

Preprocessing:
1. Feature reduction (usually NMF, PCA, or LDA)
2. Additional features
Algorithm:

Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression
Misc.

Bootstrapping, boosting, etc. Be careful not to overfit though...

Hopefully this was helpful. Leave a comment if anything was unclear

133

answered Oct 16 '22 12:10

jrhee17

P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes (basically vocabulary of words in the entire training set))

How would this sum up to 1? If using the above conditional probabilities, I assume the SUM is

P(word1|class)+P(word2|class)+...+P(wordn|class) = (total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)

To correct this, I think the P(word|class) should be like

(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))

Please correct me if I am wrong.

answered Oct 16 '22 13:10

alex

I think there are two ways to do it:

Round down tf-idf as integers, then use the multinomial distribution for the conditional probabilities. See this paper https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf.
Use Dirichlet distribution which is a continuous version of the multinomial distribution for the conditional probabilities.

I am not sure if Gaussian mixture will be better.

answered Oct 16 '22 11:10

Guojun Zhang

Related questions
                            
                                Error no module named curses
                            
                                Converting a list into comma-separated string with "and" before the last item - Python 2.7
                            
                                Remove all hex characters from string in Python
                            
                                python difference between round and int
                            
                                Python partition and split
                            
                                How to install snappy C libraries on Windows 10 for use with python-snappy in Anaconda?
                            
                                ImportError: cannot import name MAXREPEAT with cx_Freeze
                            
                                Rounding time off to the nearest second - Python
                            
                                Read Multiple images on a folder in OpenCv (python)
                            
                                dyld: Library not loaded: @executable_path/../.Python
                            
                                Tensorflow - casting from int to float strange behavior
                            
                                Mutation testing tool for Python 2.7
                            
                                setup.py & pip: override one of the dependency's sub-dependency from requirements.txt
                            
                                ImportError: No Module Named <parent dir>
                            
                                How to efficiently pass function through?
                            
                                Python 'except' fall-through
                            
                                What to do with missing values when plotting with seaborn?
                            
                                How to slice a list of tuples in python?
                            
                                Using super() in a property's setter method when using the @property decorator raises an AttributeError

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to use tf-idf with Naive Bayes?

Tags:

python-2.7

naivebayes

tf-idf