Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple example using BernoulliNB (naive bayes classifier) scikit-learn in python - cannot explain classification

Using scikit-learn 0.10

Why does the following trivial code snippet:

from sklearn.naive_bayes import *

import sklearn
from sklearn.naive_bayes import *

print sklearn.__version__

X = np.array([ [1, 1, 1, 1, 1], 
               [0, 0, 0, 0, 0] ])
print "X: ", X
Y = np.array([ 1, 2 ])
print "Y: ", Y

clf = BernoulliNB()
clf.fit(X, Y)
print "Prediction:", clf.predict( [0, 0, 0, 0, 0] )    

Print out an answer of "1" ? Having trained the model on [0,0,0,0,0] => 2 I was expecting "2" as the answer.

And why does replacing Y with

Y = np.array([ 3, 2 ])

Give a different class "2" as an answer (the correct one) ? Isn't this just a class label?

Can someone shed some light on this?

like image 675
MalteseUnderdog Avatar asked Aug 04 '12 09:08

MalteseUnderdog


2 Answers

By default, alpha, the smoothing parameter is one. As msw said, your training set is very small. Due to the smoothing, no information is left. If you set alpha to a very small value, you should see the result you expected.

like image 186
Andreas Mueller Avatar answered Sep 20 '22 08:09

Andreas Mueller


Your training set is too small as can be shown by

clf.predict_proba(X)

which yields

array([[ 0.5,  0.5],
       [ 0.5,  0.5]])

which shows that the classifier views all classifications as equiprobable. Compare with the sample shown in the documentation for BernoulliNB for which predict_proba() yields:

array([[ 2.71828146,  1.00000008,  1.00000004,  1.00000002,  1.        ],
       [ 1.00000006,  2.7182802 ,  1.00000004,  1.00000042,  1.00000007],
       [ 1.00000003,  1.00000005,  2.71828149,  1.        ,  1.00000003],
       [ 1.00000371,  1.00000794,  1.00000008,  2.71824811,  1.00000068],
       [ 1.00000007,  1.0000028 ,  1.00000149,  2.71822455,  1.00001671],
       [ 1.        ,  1.00000007,  1.00000003,  1.00000027,  2.71828083]])

where I applied numpy.exp() to results to make them more readable. Obviously, the probabilities are not even close to equal and in fact well classify the training set.

like image 31
msw Avatar answered Sep 22 '22 08:09

msw