Random Forest with bootstrap = False in scikit-learn python

Tags:

What does RandomForestClassifier() do if we choose bootstrap = False?

According to the definition in this link

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

bootstrap : boolean, optional (default=True) Whether bootstrap samples are used when building trees.

Asking this because I want to use a Random Forest approach to a time series, so train with a rolling window of size (t-n) and predict date (t+k) and wanted to know if this is what would happen if we choose True or False:

1) If Bootstrap = True, so when training samples can be of any day and of any number of features. So for example can have samples from day (t-15), day (t-19) and day (t-35) each one with randomly chosen features and then predict the output of date (t+1).

2) If Bootstrap = False, its going to use all the samples and all the features from date (t-n) to t, to train, so its actually going to respect the dates order (meaning its going to use t-35, t-34, t-33... etc until t-1). And then will predict output of date (t+1).

If this is how Bootstrap works I would be inclined to use Boostrap = False, as if not it would be a bit strange (think of financial series) to just ignore the consecutive days returns and jump from day t-39 to t-19 and then to day t-15 to predict day t+1. We would be missing all the info between those days.

So... is this how Bootstrap works?

521

asked Oct 19 '16 12:10

Gabriel

2 Answers

I don't have the reputation to comment. So I will just post my opinion here. The scikit-learn documentation says the sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). So if bootstrap = FALSE, I think every sub-sample is just as same as the original input sample.

155

answered Nov 09 '22 02:11

HannaMao

It seems like you're conflating the bootstrap of your observations with the sampling of your features. An Introduction to Statistical Learning provides a really good introduction to Random Forests.

The benefit of random forests comes from its creating a large variety of trees by sampling both observations and features. Bootstrap = False is telling it to sample observations with or without replacement - it should still sample when it's False, just without replacement.

You tell it what share of features you want to sample by setting max_features, either to a share of the features or just an integer number (and this is something that you would typically tune to find the best parameter for).

It will be fine that you're not going to have every day when you're building each tree - that's where the value of RF comes from. Each individual tree will be a pretty bad predictor, but when you average together the predictions from hundreds or thousands of trees you'll (probably) end up with a good model.

answered Nov 09 '22 02:11

Tchotchke

Related questions
                            
                                AWS Lambda function firing twice
                            
                                python - atom IDE how to enable auto-complete code to see all functions from a module
                            
                                Found another file with the destination path - where is that other file?
                            
                                Serve static files from a CDN rather than Flask in production
                            
                                Check if a pandas.Timestamp is in a pandas.Period
                            
                                Overflow / math range error for log or exp
                            
                                How to conditionally skip a test in python
                            
                                PyInstaller doesn't import Queue
                            
                                TypeError: unsupported operand type(s) for &: 'float' and 'numpy.float64' [duplicate]
                            
                                TemplateNotFound when using Airflow's PostgresOperator with Jinja templating and SQL
                            
                                initialize pandas DataFrame with defined dtypes
                            
                                Get column data by Column name and sheet name
                            
                                Is it possible to see tensorboard over ssh?
                            
                                How to pass argument to scoring function in scikit-learn's LogisticRegressionCV call
                            
                                Pandas: how to increment a column's cell value based on a list of ids
                            
                                Python Break Inside Function [duplicate]
                            
                                parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
                            
                                Implementing skip gram with scikit-learn?
                            
                                Speckle ( Lee Filter) in Python
                            
                                numpy.savetxt- Save one column as int and the rest as floats?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Random Forest with bootstrap = False in scikit-learn python

Tags:

python

machine-learning

scikit-learn

Gabriel

People also ask

2 Answers

HannaMao

Tchotchke

Recent Activity

Donate For Us