How to select training data for naive bayes classifier

Question

I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?

Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?

Daniel Canas · Accepted Answer

It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand. You can ask the following questions:

Are there any two classes that are very similar and does the learner have enough information to distinguish between them?
Is there a large difference in the prior probabilities of each class?

If so, you should probably redistribute the classes.

In my experience, there is no harm in redistributing the classes, but it's not always necessary.

It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%. In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.

Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.

Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.

If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter Q. On the other hand, the letter O is very common. If you don't redistribute the classes to allow for the same number of Q's and O's, the classifier doesn't have enough information to ever distinguish a Q. You need to feed it enough information (i.e. more Qs) so it can determine that Q and O are indeed different letters.

How to select training data for naive bayes classifier

Tags:

machine-learning

classification

goh

1 Answers

Daniel Canas

Recent Activity

Donate For Us

How to select training data for naive bayes classifier

Tags:

machine-learning

classification

goh

1 Answers

Daniel Canas

Related questions

Recent Activity

Donate For Us