Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikits learn and nltk: Naive Bayes classifier performance highly different

I am comparing two Naive Bayes classifiers: one from NLTK and and one from scikit-learn. I'm dealing with a multi-class classification problem (3 classes: positive (1), negative (-1), and neutral (0)).

Without performing any feature selection (that is, using all features available), and using a training dataset of 70,000 instances (noisy-labeled, with an instance distribution of 17% positive, 4% negative and 78% neutral), I train two classifiers, the first one is a nltk.NaiveBayesClassifier, and the second one is a sklearn.naive_bayes.MultinomialNB (with fit_prior=True).

After training, I evaluated the classifiers on my test set of 30,000 instances and I get the following results:

**NLTK's NaiveBayes**
accuracy: 0.568740
class: 1
     precision: 0.331229
     recall: 0.331565
     F-Measure: 0.331355
class: -1
     precision: 0.079253 
     recall: 0.446331 
     F-Measure: 0.134596 
class: 0
     precision: 0.849842 
     recall: 0.628126 
     F-Measure: 0.722347 


**Scikit's MultinomialNB (with fit_prior=True)**
accuracy: 0.834670
class: 1
     precision: 0.400247
     recall: 0.125359
     F-Measure: 0.190917
class: -1
     precision: 0.330836
     recall: 0.012441
     F-Measure: 0.023939
class: 0
     precision: 0.852997
     recall: 0.973406
     F-Measure: 0.909191

**Scikit's MultinomialNB (with fit_prior=False)**
accuracy: 0.834680
class: 1
     precision: 0.400380
     recall: 0.125361
     F-Measure: 0.190934
class: -1
     precision: 0.330836
     recall: 0.012441
     F-Measure: 0.023939
class: 0
     precision: 0.852998
     recall: 0.973418
     F-Measure: 0.909197

I have noticed that while Scikit's classifier has better overall accuracy and precision, its recall is very low compared to the NLTK one, at least with my data. Taking into account that they might be (almost) the same classifiers, isn't this strange?

like image 246
D T Avatar asked May 02 '12 03:05

D T


People also ask

How accurate is the naive Bayes classifier in scikit-learn?

The accuracy of the Naive Bayes Classifier for Scikit-learn implementation was 56.5%, while for ML.NET it was 41.5%. The difference may be due to other ways of algorithm implementation, but based on the accuracy alone we cannot say which is better.

What is the difference between scikit and NLTK classifiers?

I have noticed that while Scikit's classifier has better overall accuracy and precision, its recall is very low compared to the NLTK one, at least with my data. Taking into account that they might be (almost) the same classifiers, isn't this strange?

What are the different types of naive Bayes algorithms?

One of the most important libraries that we use in Python, the Scikit-learn provides three Naive Bayes implementations: Bernoulli, multinomial, and Gaussian. Before we dig deeper into Naive Bayes classification in order to understand what each of these variations in the Naive Bayes Algorithm will do, let us understand them briefly…

Is naive Bayes better than logistic regression?

As you can see, the Naive Bayes performances are slightly better than logistic regression. Both the classifiers have similar accuracy and Area Under the Curve. When trying the multinomial Naive Bayes & the Gaussian variant as well, the results come very similar.


1 Answers

Is the default behavior for class weights the same in both libraries? The difference in precision for the rare class (-1) looks like that might be the cause...

like image 174
Marc Shivers Avatar answered Oct 19 '22 09:10

Marc Shivers