Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

interpreting Naive Bayes results

I start using NaiveBayes/Simple classifier for classification (Weka), however I have some problems to understand while training the data. The data set I'm using is weather.nominal.arff.

alt text

While I use use training test from the options, the classifier result is:

Correctly Classified Instances 13  -  92.8571 %    
Incorrectly Classified Instances 1 - 7.1429 %   

a b classified as  
9 0  a =yes
1 4  b = no

My first question what should I understand from the incorrect classified instances? Why such a problem occurred? which attribute collection is classified incorrect? is there a way to understand this?

Secondly, when I try the 10 fold cross validation, why I get different (less) correctly classified instances?

The results are:

Correctly Classified Instances           8               57.1429 %
Incorrectly Classified Instances         6               42.8571 %

 a b   <-- classified as
 7 2 | a = yes
 4 1 | b = no
like image 920
berkay Avatar asked Sep 06 '10 03:09

berkay


1 Answers

You can get the individual predictions for each instance by choosing this option from:

More Options... > Output predictions > PlainText

Which will give you in addition to the evaluation metrics, the following:

=== Predictions on training set ===

 inst#     actual  predicted error prediction
     1       2:no       2:no       0.704 
     2       2:no       2:no       0.847 
     3      1:yes      1:yes       0.737 
     4      1:yes      1:yes       0.554 
     5      1:yes      1:yes       0.867 
     6       2:no      1:yes   +   0.737 
     7      1:yes      1:yes       0.913 
     8       2:no       2:no       0.588 
     9      1:yes      1:yes       0.786 
    10      1:yes      1:yes       0.845 
    11      1:yes      1:yes       0.568 
    12      1:yes      1:yes       0.667 
    13      1:yes      1:yes       0.925 
    14       2:no       2:no       0.652 

which indicates that the 6th instances was misclassified. Note that even if you train and test on the same instances, misclassifications can occur due to inconsistencies in the data (the simplest example is having two instances with the same features but with different class label).

Keep in mind that the above way of testing is biased (its somewhat cheating since it can see the answers to the questions). Thus we are usually interested in getting a more realistic estimate of the model error on unseen data. Cross-validation is one such technique, where it partition the data into 10 stratified folds, performing the testing on one fold, while training on the other nine, finally it reports the average accuracy across the ten runs.

like image 162
Amro Avatar answered Oct 12 '22 12:10

Amro