Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Size of sample in Random Forest Regression

If understand correctly, when Random Forest estimators are calculated usually bootstrapping is applied, which means that a tree(i) is built only using data from sample(i), chosen with replacement. I want to know what is the size of the sample that sklearn RandomForestRegressor uses.

The only thing that I see that is close:

bootstrap : boolean, optional (default=True)
    Whether bootstrap samples are used when building trees.

But there is no way to specify the size or proportion of the sample size, nor does it tell me about the default sample size.

I feel like there should be way to at least know what the default sample size is, what am I missing?

like image 915
Akavall Avatar asked Jul 08 '15 21:07

Akavall


People also ask

What is a good sample size for random forest?

For random forests to work as well in new data as they do in training data, the required sample size is enormous, often being 200 times the number of candidate features.

How many samples do you need for random forest?

For testing, 10 is enough but to achieve robust results, you can increase it up to 100 or 500. This however only makes sense if you have more than 8 input rasters, otherwise the training data is always the same, even if you repeat it 1000 times.

Does random forest work well on small dataset?

Conclusion: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.

Is random forest suitable for large dataset?

Random Forest is suitable for situations when we have a large dataset, and interpretability is not a major concern. Decision trees are much easier to interpret and understand. Since a random forest combines multiple decision trees, it becomes more difficult to interpret.


3 Answers

Uhh, I agree with you it's quite strange that we cannot specify the subsample/bootstrap size in RandomForestRegressor algo. Maybe a potential workaround is to use BaggingRegressor instead. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor

RandomForestRegressor is just a special case of BaggingRegressor (use bootstraps to reduce the variance of a set of low-bias-high-variance estimators). In RandomForestRegressor, the base estimator is forced to be DeceisionTree, whereas in BaggingRegressor, you have the freedom to choose the base_estimator. More importantly, you can set your customized subsample size, for example max_samples=0.5 will draw random subsamples with size equal to half of the entire training set. Also, you can choose just a subset of features by setting max_features and bootstrap_features.

like image 182
Jianxun Li Avatar answered Oct 16 '22 23:10

Jianxun Li


In the 0.22 version of scikit-learn, the max_samples option has been added, doing what you asked : here the documentation of the class.

like image 28
Ezriel_S Avatar answered Oct 16 '22 21:10

Ezriel_S


The sample size for bootstrap is always the number of samples.

You are not missing anything, the same question was asked on the mailing list for RandomForestClassifier:

The bootstrap sample size is always the same as the input sample size. If you feel up to it, a pull request updating the documentation would probably be quite welcome.

like image 26
ldirer Avatar answered Oct 16 '22 23:10

ldirer