Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct ratio of positive to negative training examples for training a random forest-based binary classifier

I realized that the related question Positives/negatives proportion in train set suggested that a 1-to-1 ratio of positive to negative training examples is favorable for the Rocchio algorithm.

However, this question differs from the related question in that it concerns a random forest model and also in the following two ways.

1) I have plenty of training data to work with, and the main bottleneck on using more training examples is training iteration time. That is, I'd prefer not to take more than a night to train one ranker because I want to iterate quickly.

2) In practice, the classifier will probably see 1 positive example for every 4 negative examples.

In this situation, should I train using more negative examples than positive examples, or still equal numbers of positive and negative examples?

like image 593
merlin2011 Avatar asked Jul 28 '13 05:07

merlin2011


People also ask

Why are negative examples necessary when training a binary classifier?

Without examples from both classes, the model has no way of telling how the features differ between classes (e.g., what properties of an essay make it more or less likely to be written by a male student).

What is meant by positive and negative samples in machine learning?

A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class. A false positive is an outcome where the model incorrectly predicts the positive class.

Can random forest be used for binary classification?

We can also use the random forest model as a final model and make predictions for classification. First, the random forest ensemble is fit on all available data, then the predict() function can be called to make predictions on new data. The example below demonstrates this on our binary classification dataset.

How are random forests trained?

Random Forests are trained via the bagging method. Bagging or Bootstrap Aggregating, consists of randomly sampling subsets of the training data, fitting a model to these smaller data sets, and aggregating the predictions.


Video Answer


2 Answers

See the section titled "Balancing prediction error" from the official documentation on random forests here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

I marked some parts in bold.

In summary, this seems to suggest that your training and test data should either

  1. reflect the 1:4 ratio of classes that your real-life data will have or
  2. you can have a 1:1 mix, but then you should carefully adjust the weights per class as demonstrated below till the OOB error rate on your desired (smaller) class is lowered

Hope that helps.

In some data sets, the prediction error between classes is highly unbalanced. Some classes have a low prediction error, others a high. This occurs usually when one class is much larger than another. Then random forests, trying to minimize overall error rate, will keep the error rate low on the large class while letting the smaller classes have a larger error rate. For instance, in drug discovery, where a given molecule is classified as active or not, it is common to have the actives outnumbered by 10 to 1, up to 100 to 1. In these situations the error rate on the interesting class (actives) will be very high.

The user can detect the imbalance by outputs the error rates for the individual classes. To illustrate 20 dimensional synthetic data is used. Class 1 occurs in one spherical Gaussian, class 2 on another. A training set of 1000 class 1's and 50 class 2's is generated, together with a test set of 5000 class 1's and 250 class 2's.

The final output of a forest of 500 trees on this data is:

500 3.7 0.0 78.4

There is a low overall test set error (3.73%) but class 2 has over 3/4 of its cases misclassified.

The error balancing can be done by setting different weights for the classes.

The higher the weight a class is given, the more its error rate is decreased. A guide as to what weights to give is to make them inversely proportional to the class populations. So set weights to 1 on class 1, and 20 on class 2, and run again. The output is:

500 12.1 12.7 0.0

The weight of 20 on class 2 is too high. Set it to 10 and try again, getting:

500 4.3 4.2 5.2

This is pretty close to balance. If exact balance is wanted, the weight on class 2 could be jiggled around a bit more.

Note that in getting this balance, the overall error rate went up. This is the usual result - to get better balance, the overall error rate will be increased.

like image 66
vijucat Avatar answered Sep 20 '22 11:09

vijucat


This might seem like a trivial answer but the best thing I can suggest is to try on a small subset of your data (small enough that the algorithm trains quickly), and observe what you accuracy is when you use 1-1, 1-2, 1-3 etc...

Plot the results as you gradually increase the total amount of examples for each ratio and see how the performance responds. Very often you'll find that fractions of the data get very close to the performance of training on the full dataset, in which case you can make an informed decision to your question.

Hope that helps.

like image 27
Mike Avatar answered Sep 22 '22 11:09

Mike