Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random Forest with bootstrap = False in scikit-learn python

What does RandomForestClassifier() do if we choose bootstrap = False?

According to the definition in this link

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

bootstrap : boolean, optional (default=True) Whether bootstrap samples are used when building trees.

Asking this because I want to use a Random Forest approach to a time series, so train with a rolling window of size (t-n) and predict date (t+k) and wanted to know if this is what would happen if we choose True or False:

1) If Bootstrap = True, so when training samples can be of any day and of any number of features. So for example can have samples from day (t-15), day (t-19) and day (t-35) each one with randomly chosen features and then predict the output of date (t+1).

2) If Bootstrap = False, its going to use all the samples and all the features from date (t-n) to t, to train, so its actually going to respect the dates order (meaning its going to use t-35, t-34, t-33... etc until t-1). And then will predict output of date (t+1).

If this is how Bootstrap works I would be inclined to use Boostrap = False, as if not it would be a bit strange (think of financial series) to just ignore the consecutive days returns and jump from day t-39 to t-19 and then to day t-15 to predict day t+1. We would be missing all the info between those days.

So... is this how Bootstrap works?

like image 521
Gabriel Avatar asked Oct 19 '16 12:10

Gabriel


People also ask

Should bootstrap be used in random forest?

One of the parameters in this implementation of random forests allows you to set Bootstrap = True/False. While tuning the hyperparameters of my model to my dataset, both random search and genetic algorithms consistently find that setting bootstrap=False results in a better model (accuracy increases >1%).

How do you make a random forest more accurate in Python?

There are three general approaches for improving an existing machine learning model: Use more (high-quality) data and feature engineering. Tune the hyperparameters of the algorithm. Try different algorithms.

What is bootstrap true in random forest?

1) If Bootstrap = True , so when training samples can be of any day and of any number of features. So for example can have samples from day (t-15), day (t-19) and day (t-35) each one with randomly chosen features and then predict the output of date (t+1).


2 Answers

I don't have the reputation to comment. So I will just post my opinion here. The scikit-learn documentation says the sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). So if bootstrap = FALSE, I think every sub-sample is just as same as the original input sample.

like image 155
HannaMao Avatar answered Nov 09 '22 02:11

HannaMao


It seems like you're conflating the bootstrap of your observations with the sampling of your features. An Introduction to Statistical Learning provides a really good introduction to Random Forests.

The benefit of random forests comes from its creating a large variety of trees by sampling both observations and features. Bootstrap = False is telling it to sample observations with or without replacement - it should still sample when it's False, just without replacement.

You tell it what share of features you want to sample by setting max_features, either to a share of the features or just an integer number (and this is something that you would typically tune to find the best parameter for).

It will be fine that you're not going to have every day when you're building each tree - that's where the value of RF comes from. Each individual tree will be a pretty bad predictor, but when you average together the predictions from hundreds or thousands of trees you'll (probably) end up with a good model.

like image 36
Tchotchke Avatar answered Nov 09 '22 02:11

Tchotchke