I realized that the related question Positives/negatives proportion in train set suggested that a 1-to-1 ratio of positive to negative training examples is favorable for the Rocchio algorithm.
However, this question differs from the related question in that it concerns a random forest model and also in the following two ways.
1) I have plenty of training data to work with, and the main bottleneck on using more training examples is training iteration time. That is, I'd prefer not to take more than a night to train one ranker because I want to iterate quickly.
2) In practice, the classifier will probably see 1 positive example for every 4 negative examples.
In this situation, should I train using more negative examples than positive examples, or still equal numbers of positive and negative examples?
Without examples from both classes, the model has no way of telling how the features differ between classes (e.g., what properties of an essay make it more or less likely to be written by a male student).
A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class. A false positive is an outcome where the model incorrectly predicts the positive class.
We can also use the random forest model as a final model and make predictions for classification. First, the random forest ensemble is fit on all available data, then the predict() function can be called to make predictions on new data. The example below demonstrates this on our binary classification dataset.
Random Forests are trained via the bagging method. Bagging or Bootstrap Aggregating, consists of randomly sampling subsets of the training data, fitting a model to these smaller data sets, and aggregating the predictions.
See the section titled "Balancing prediction error" from the official documentation on random forests here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
I marked some parts in bold.
In summary, this seems to suggest that your training and test data should either
Hope that helps.
In some data sets, the prediction error between classes is highly unbalanced. Some classes have a low prediction error, others a high. This occurs usually when one class is much larger than another. Then random forests, trying to minimize overall error rate, will keep the error rate low on the large class while letting the smaller classes have a larger error rate. For instance, in drug discovery, where a given molecule is classified as active or not, it is common to have the actives outnumbered by 10 to 1, up to 100 to 1. In these situations the error rate on the interesting class (actives) will be very high.
The user can detect the imbalance by outputs the error rates for the individual classes. To illustrate 20 dimensional synthetic data is used. Class 1 occurs in one spherical Gaussian, class 2 on another. A training set of 1000 class 1's and 50 class 2's is generated, together with a test set of 5000 class 1's and 250 class 2's.
The final output of a forest of 500 trees on this data is:
500 3.7 0.0 78.4
There is a low overall test set error (3.73%) but class 2 has over 3/4 of its cases misclassified.
The error balancing can be done by setting different weights for the classes.
The higher the weight a class is given, the more its error rate is decreased. A guide as to what weights to give is to make them inversely proportional to the class populations. So set weights to 1 on class 1, and 20 on class 2, and run again. The output is:
500 12.1 12.7 0.0
The weight of 20 on class 2 is too high. Set it to 10 and try again, getting:
500 4.3 4.2 5.2
This is pretty close to balance. If exact balance is wanted, the weight on class 2 could be jiggled around a bit more.
Note that in getting this balance, the overall error rate went up. This is the usual result - to get better balance, the overall error rate will be increased.
This might seem like a trivial answer but the best thing I can suggest is to try on a small subset of your data (small enough that the algorithm trains quickly), and observe what you accuracy is when you use 1-1, 1-2, 1-3 etc...
Plot the results as you gradually increase the total amount of examples for each ratio and see how the performance responds. Very often you'll find that fractions of the data get very close to the performance of training on the full dataset, in which case you can make an informed decision to your question.
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With