Perhaps this is too long-winded. Simple question about sklearn's random forest:
For a true/false classification problem, is there a way in sklearn's random forest to specify the sample size used to train each tree, along with the ratio of true to false observations?
More details are below:
In the R implementation of random forest, called randomForest, there's an option sampsize()
. This allows you to balance the sample used to train each tree based on the outcome.
For example, if you're trying to predict whether an outcome is true or false and 90% of the outcomes in the training set are false, you can set sampsize(500, 500)
. This means that each tree will be trained on a random sample (with replacement) from the training set with 500 true and 500 false observations. In these situations, I've found models perform much better predicting true outcomes when using a 50% cut-off, yielding much higher kappas.
It doesn't seem like there is an option for this in the sklearn implementation.
The standard random forest is then fed with more random-trees generated based on the critical areas. The results show that the proposed algorithm is very effective in dealing with the class imbalance problem.
Random forest selects a random sample from the training set, creates a decision tree for it and gets a prediction; it repeats this operation for the assigned number of the trees, performs a vote for each prediction, and takes the result with the majority of votes (in case of classification) or the average (in case of ...
The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
Verbosity in keyword arguments usually means showing more 'wordy' information for the task. In this case, for machine learning, by setting verbose to a higher number ( 2 vs 1 ), you may see more information about the tree building process.
In version 0.16-dev, you can now use class_weight="auto"
to have something close to what you want to do. This will still use all samples, but it will reweight them so that classes become balanced.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With