I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?
Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?
It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand. You can ask the following questions:
If so, you should probably redistribute the classes.
In my experience, there is no harm in redistributing the classes, but it's not always necessary.
It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%. In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.
Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.
Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.
If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter Q
. On the other hand, the letter O
is very common. If you don't redistribute the classes to allow for the same number of Q
's and O
's, the classifier doesn't have enough information to ever distinguish a Q
. You need to feed it enough information (i.e. more Q
s) so it can determine that Q
and O
are indeed different letters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With