scikits learn and nltk: Naive Bayes classifier performance highly different

Tags:

I am comparing two Naive Bayes classifiers: one from NLTK and and one from scikit-learn. I'm dealing with a multi-class classification problem (3 classes: positive (1), negative (-1), and neutral (0)).

Without performing any feature selection (that is, using all features available), and using a training dataset of 70,000 instances (noisy-labeled, with an instance distribution of 17% positive, 4% negative and 78% neutral), I train two classifiers, the first one is a nltk.NaiveBayesClassifier, and the second one is a sklearn.naive_bayes.MultinomialNB (with fit_prior=True).

After training, I evaluated the classifiers on my test set of 30,000 instances and I get the following results:

**NLTK's NaiveBayes**
accuracy: 0.568740
class: 1
     precision: 0.331229
     recall: 0.331565
     F-Measure: 0.331355
class: -1
     precision: 0.079253 
     recall: 0.446331 
     F-Measure: 0.134596 
class: 0
     precision: 0.849842 
     recall: 0.628126 
     F-Measure: 0.722347 


**Scikit's MultinomialNB (with fit_prior=True)**
accuracy: 0.834670
class: 1
     precision: 0.400247
     recall: 0.125359
     F-Measure: 0.190917
class: -1
     precision: 0.330836
     recall: 0.012441
     F-Measure: 0.023939
class: 0
     precision: 0.852997
     recall: 0.973406
     F-Measure: 0.909191

**Scikit's MultinomialNB (with fit_prior=False)**
accuracy: 0.834680
class: 1
     precision: 0.400380
     recall: 0.125361
     F-Measure: 0.190934
class: -1
     precision: 0.330836
     recall: 0.012441
     F-Measure: 0.023939
class: 0
     precision: 0.852998
     recall: 0.973418
     F-Measure: 0.909197

I have noticed that while Scikit's classifier has better overall accuracy and precision, its recall is very low compared to the NLTK one, at least with my data. Taking into account that they might be (almost) the same classifiers, isn't this strange?

246

asked May 02 '12 03:05

D T

1 Answers

Is the default behavior for class weights the same in both libraries? The difference in precision for the rare class (-1) looks like that might be the cause...

174

answered Oct 19 '22 09:10

Marc Shivers

Related questions
                            
                                CVXOPT with only equality constraints
                            
                                Difference between workers and worker_connections in gunicorn?
                            
                                Stream OpenCV frame to HTML in Python
                            
                                pymongo.errors.ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused
                            
                                WebDriverException: Message: The command 'GET /session/7.../displayed' was not found while Explicit Wait with safaridriver and Selenium 3.13.0
                            
                                Graph optimizations on a tensorflow serveable created using tf.Estimator
                            
                                Data structure for arrays which share some elements -- Python
                            
                                how is total loss calculated over multiple classes in Keras?
                            
                                Solving PDE with implicit euler in python - incorrect output
                            
                                Putting .SVG images into tkinter Frame
                            
                                How to export a plotly dashboard app into a html standalone file to share with the others?
                            
                                django-paypal setup
                            
                                How to correlate two time series with gaps and different time bases?
                            
                                GUI development with IronPython and Visual Studio 2010
                            
                                How to create a custom Python exception type in C extension?
                            
                                What can change my floating point control word behind my back?
                            
                                Using the python multiprocessing module for IO with pygame on Mac OS 10.7
                            
                                Logic game: maximising (or minimising) the chances for two agents to meet
                            
                                UnknownTimezoneError Exception Raised with Python Application Compiled with Py2Exe
                            
                                Django. Thread safe update or create.

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scikits learn and nltk: Naive Bayes classifier performance highly different

Tags:

python

machine-learning

nltk

scikit-learn

scikits

D T

People also ask

1 Answers

Marc Shivers

Recent Activity

Donate For Us