If understand correctly, when Random Forest estimators are calculated usually bootstrapping is applied, which means that a tree(i) is built only using data from sample(i), chosen with replacement. I want to know what is the size of the sample that sklearn RandomForestRegressor uses.
The only thing that I see that is close:
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
But there is no way to specify the size or proportion of the sample size, nor does it tell me about the default sample size.
I feel like there should be way to at least know what the default sample size is, what am I missing?
For random forests to work as well in new data as they do in training data, the required sample size is enormous, often being 200 times the number of candidate features.
For testing, 10 is enough but to achieve robust results, you can increase it up to 100 or 500. This however only makes sense if you have more than 8 input rasters, otherwise the training data is always the same, even if you repeat it 1000 times.
Conclusion: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.
Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. Decision trees are much easier to interpret and understand. Since a random forest combines multiple decision trees, it becomes more difficult to interpret.
Uhh, I agree with you it's quite strange that we cannot specify the subsample/bootstrap size in RandomForestRegressor
algo. Maybe a potential workaround is to use BaggingRegressor
instead. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor
RandomForestRegressor
is just a special case of BaggingRegressor
(use bootstraps to reduce the variance of a set of low-bias-high-variance estimators). In RandomForestRegressor
, the base estimator is forced to be DeceisionTree
, whereas in BaggingRegressor
, you have the freedom to choose the base_estimator
. More importantly, you can set your customized subsample size, for example max_samples=0.5
will draw random subsamples with size equal to half of the entire training set. Also, you can choose just a subset of features by setting max_features
and bootstrap_features
.
In the 0.22 version of scikit-learn, the max_samples
option has been added, doing what you asked : here the documentation of the class.
The sample size for bootstrap is always the number of samples.
You are not missing anything, the same question was asked on the mailing list for RandomForestClassifier
:
The bootstrap sample size is always the same as the input sample size. If you feel up to it, a pull request updating the documentation would probably be quite welcome.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With