Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select training data for naive bayes classifier

I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?

Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?

like image 933
goh Avatar asked Jul 05 '11 08:07

goh


1 Answers

It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand. You can ask the following questions:

  • Are there any two classes that are very similar and does the learner have enough information to distinguish between them?
  • Is there a large difference in the prior probabilities of each class?

If so, you should probably redistribute the classes.

In my experience, there is no harm in redistributing the classes, but it's not always necessary.

It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%. In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.

Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.

Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.

If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter Q. On the other hand, the letter O is very common. If you don't redistribute the classes to allow for the same number of Q's and O's, the classifier doesn't have enough information to ever distinguish a Q. You need to feed it enough information (i.e. more Qs) so it can determine that Q and O are indeed different letters.

like image 73
Daniel Canas Avatar answered Oct 21 '22 23:10

Daniel Canas