If understand correctly, when Random Forest estimators are calculated usually bootstrapping is applied, which means that a tree(i) is built only using data from sample(i), chosen with replacement. I want to know what is the size of the sample that sklearn RandomForestRegressor uses. The only thing that I see that is close: <pre class="prettyprint"><code>bootstrap : boolean, optional (default=True) Whether bootstrap samples are used when building trees. </code></pre> But there is no way to specify the size or proportion of the sample size, nor does it tell me about the default sample size. I feel like there should be way to at least know what the default sample size is, what am I missing?

Uhh, I agree with you it's quite strange that we cannot specify the subsample/bootstrap size in <code>RandomForestRegressor</code> algo. Maybe a potential workaround is to use <code>BaggingRegressor</code> instead. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor <code>RandomForestRegressor</code> is just a special case of <code>BaggingRegressor</code> (use bootstraps to reduce the variance of a set of low-bias-high-variance estimators). In <code>RandomForestRegressor</code>, the base estimator is forced to be <code>DeceisionTree</code>, whereas in <code>BaggingRegressor</code>, you have the freedom to choose the <code>base_estimator</code>. More importantly, you can set your customized subsample size, for example <code>max_samples=0.5</code> will draw random subsamples with size equal to half of the entire training set. Also, you can choose just a subset of features by setting <code>max_features</code> and <code>bootstrap_features</code>.

In the 0.22 version of scikit-learn, the <code>max_samples</code> option has been added, doing what you asked : here the documentation of the class.

The sample size for bootstrap is always the number of samples. You are not missing anything, the same question was asked on the mailing list for <code>RandomForestClassifier</code>: <blockquote> The bootstrap sample size is always the same as the input sample size. If you feel up to it, a pull request updating the documentation would probably be quite welcome. </blockquote>

Size of sample in Random Forest Regression

Tags:

python

machine-learning

scikit-learn

random-forest

If understand correctly, when Random Forest estimators are calculated usually bootstrapping is applied, which means that a tree(i) is built only using data from sample(i), chosen with replacement. I want to know what is the size of the sample that sklearn RandomForestRegressor uses.

The only thing that I see that is close:

bootstrap : boolean, optional (default=True)
    Whether bootstrap samples are used when building trees.

But there is no way to specify the size or proportion of the sample size, nor does it tell me about the default sample size.

I feel like there should be way to at least know what the default sample size is, what am I missing?

915

asked Jul 08 '15 21:07

Akavall

3 Answers

Uhh, I agree with you it's quite strange that we cannot specify the subsample/bootstrap size in RandomForestRegressor algo. Maybe a potential workaround is to use BaggingRegressor instead. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor

RandomForestRegressor is just a special case of BaggingRegressor (use bootstraps to reduce the variance of a set of low-bias-high-variance estimators). In RandomForestRegressor, the base estimator is forced to be DeceisionTree, whereas in BaggingRegressor, you have the freedom to choose the base_estimator. More importantly, you can set your customized subsample size, for example max_samples=0.5 will draw random subsamples with size equal to half of the entire training set. Also, you can choose just a subset of features by setting max_features and bootstrap_features.

182

answered Oct 16 '22 23:10

Jianxun Li

In the 0.22 version of scikit-learn, the max_samples option has been added, doing what you asked : here the documentation of the class.

answered Oct 16 '22 21:10

Ezriel_S

The sample size for bootstrap is always the number of samples.

You are not missing anything, the same question was asked on the mailing list for RandomForestClassifier:

The bootstrap sample size is always the same as the input sample size. If you feel up to it, a pull request updating the documentation would probably be quite welcome.

answered Oct 16 '22 23:10

ldirer

Related questions
                            
                                version conflict for package "Tk": have 8.5.2, need exactly 8.5.15
                            
                                Configuring Gunicorn: No application module specified
                            
                                Embedding key as string in Paramiko application
                            
                                Redirect output of python/ipython interactive prompt commands to files or variables
                            
                                Recursive XML parsing python using ElementTree
                            
                                SQLAlchemy filter according to nested keys in JSONB
                            
                                How to use inline regex modifier in python [duplicate]
                            
                                How to continue to the next loop iteration in Python PDB?
                            
                                How to apply rolling functions in a group by object in pandas
                            
                                Python Requests, how to bind to different source ip for each request? [duplicate]
                            
                                Runtime of merging two lists in Python
                            
                                cx_Freeze help: is there a way to NOT make console open?
                            
                                Generate random sparse matrix filled with only values 0 or 1
                            
                                How to upload a text file using Python-Requests without writing to disk
                            
                                How to set custom output handlers for argparse in Python?
                            
                                SQLAlchemy "or" statement with multiple parameters
                            
                                Use of Hyphen or Minus Sign in Matplotlib versus Compatibility with Latex
                            
                                no such column: django_content_type.name
                            
                                Print a nested list line by line - Python
                            
                                How to set Send Buffer Size for sockets in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With