What does RandomForestClassifier() do if we choose bootstrap = False?
According to the definition in this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
bootstrap : boolean, optional (default=True) Whether bootstrap samples are used when building trees.
Asking this because I want to use a Random Forest approach to a time series, so train with a rolling window of size (t-n) and predict date (t+k) and wanted to know if this is what would happen if we choose True or False:
1) If Bootstrap = True
, so when training samples can be of any day and of any number of features. So for example can have samples from day (t-15), day (t-19) and day (t-35) each one with randomly chosen features and then predict the output of date (t+1).
2) If Bootstrap = False
, its going to use all the samples and all the features from date (t-n) to t, to train, so its actually going to respect the dates order (meaning its going to use t-35, t-34, t-33... etc until t-1). And then will predict output of date (t+1).
If this is how Bootstrap works I would be inclined to use Boostrap = False, as if not it would be a bit strange (think of financial series) to just ignore the consecutive days returns and jump from day t-39 to t-19 and then to day t-15 to predict day t+1. We would be missing all the info between those days.
So... is this how Bootstrap works?
One of the parameters in this implementation of random forests allows you to set Bootstrap = True/False. While tuning the hyperparameters of my model to my dataset, both random search and genetic algorithms consistently find that setting bootstrap=False results in a better model (accuracy increases >1%).
There are three general approaches for improving an existing machine learning model: Use more (high-quality) data and feature engineering. Tune the hyperparameters of the algorithm. Try different algorithms.
1) If Bootstrap = True , so when training samples can be of any day and of any number of features. So for example can have samples from day (t-15), day (t-19) and day (t-35) each one with randomly chosen features and then predict the output of date (t+1).
I don't have the reputation to comment. So I will just post my opinion here. The scikit-learn documentation says the sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). So if bootstrap = FALSE, I think every sub-sample is just as same as the original input sample.
It seems like you're conflating the bootstrap of your observations with the sampling of your features. An Introduction to Statistical Learning provides a really good introduction to Random Forests.
The benefit of random forests comes from its creating a large variety of trees by sampling both observations and features. Bootstrap = False
is telling it to sample observations with or without replacement - it should still sample when it's False, just without replacement.
You tell it what share of features you want to sample by setting max_features
, either to a share of the features or just an integer number (and this is something that you would typically tune to find the best parameter for).
It will be fine that you're not going to have every day when you're building each tree - that's where the value of RF comes from. Each individual tree will be a pretty bad predictor, but when you average together the predictions from hundreds or thousands of trees you'll (probably) end up with a good model.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With