Correct ratio of positive to negative training examples for training a random forest-based binary classifier

Tags:

random-forest

I realized that the related question Positives/negatives proportion in train set suggested that a 1-to-1 ratio of positive to negative training examples is favorable for the Rocchio algorithm.

However, this question differs from the related question in that it concerns a random forest model and also in the following two ways.

1) I have plenty of training data to work with, and the main bottleneck on using more training examples is training iteration time. That is, I'd prefer not to take more than a night to train one ranker because I want to iterate quickly.

2) In practice, the classifier will probably see 1 positive example for every 4 negative examples.

In this situation, should I train using more negative examples than positive examples, or still equal numbers of positive and negative examples?

593

asked Jul 28 '13 05:07

merlin2011

Video Answer

2 Answers

See the section titled "Balancing prediction error" from the official documentation on random forests here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

I marked some parts in bold.

In summary, this seems to suggest that your training and test data should either

reflect the 1:4 ratio of classes that your real-life data will have or
you can have a 1:1 mix, but then you should carefully adjust the weights per class as demonstrated below till the OOB error rate on your desired (smaller) class is lowered

Hope that helps.

In some data sets, the prediction error between classes is highly unbalanced. Some classes have a low prediction error, others a high. This occurs usually when one class is much larger than another. Then random forests, trying to minimize overall error rate, will keep the error rate low on the large class while letting the smaller classes have a larger error rate. For instance, in drug discovery, where a given molecule is classified as active or not, it is common to have the actives outnumbered by 10 to 1, up to 100 to 1. In these situations the error rate on the interesting class (actives) will be very high.

The user can detect the imbalance by outputs the error rates for the individual classes. To illustrate 20 dimensional synthetic data is used. Class 1 occurs in one spherical Gaussian, class 2 on another. A training set of 1000 class 1's and 50 class 2's is generated, together with a test set of 5000 class 1's and 250 class 2's.

The final output of a forest of 500 trees on this data is:

500 3.7 0.0 78.4

There is a low overall test set error (3.73%) but class 2 has over 3/4 of its cases misclassified.

The error balancing can be done by setting different weights for the classes.

The higher the weight a class is given, the more its error rate is decreased. A guide as to what weights to give is to make them inversely proportional to the class populations. So set weights to 1 on class 1, and 20 on class 2, and run again. The output is:

500 12.1 12.7 0.0

The weight of 20 on class 2 is too high. Set it to 10 and try again, getting:

500 4.3 4.2 5.2

This is pretty close to balance. If exact balance is wanted, the weight on class 2 could be jiggled around a bit more.

Note that in getting this balance, the overall error rate went up. This is the usual result - to get better balance, the overall error rate will be increased.

answered Sep 20 '22 11:09

vijucat

This might seem like a trivial answer but the best thing I can suggest is to try on a small subset of your data (small enough that the algorithm trains quickly), and observe what you accuracy is when you use 1-1, 1-2, 1-3 etc...

Plot the results as you gradually increase the total amount of examples for each ratio and see how the performance responds. Very often you'll find that fractions of the data get very close to the performance of training on the full dataset, in which case you can make an informed decision to your question.

Hope that helps.

answered Sep 22 '22 11:09

Mike

Related questions
                            
                                My LSTM learns, loss decreases, but Numerical Gradients don't match Analytical Gradients
                            
                                how to run a pre-trained model in AWS sagemaker?
                            
                                What is the default batch size of pytorch SGD?
                            
                                Google colab pro GPU running extremely slow
                            
                                How can a genetic algorithm optimize a neural network's weights without knowing the search volume?
                            
                                When to use @tf.function decorator and when not? I know tf.function builds graph. But how to know when to build graphs?
                            
                                Can we make the ML model (pickle file) more robust, by accepting (or ignoring) new features?
                            
                                Python NEAT not learning further after a certain point
                            
                                How to rotate an image to align the text for extraction?
                            
                                What is the meaning of 'for _ in range() [duplicate]
                            
                                Calculating Nearest Match to Mean/Stddev Pair With LibSVM
                            
                                U-matrix and self organizing maps
                            
                                Is there a viable handwriting recognition library / program? [closed]
                            
                                Interpreting coefficient names in glmnet in R
                            
                                WEKA classification likelihood of the classes
                            
                                Combining Weak Learners into a Strong Classifier
                            
                                Classification with naiveBayes (e1071) does not work ($levels returns NULL)
                            
                                How to perform 10 fold cross validation with LibSVM in R?
                            
                                Python Scikit Random Forest Regressor Error
                            
                                How to implement a better sliding window algorithm?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With