Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unconclusive RandomForest documentation in ScikitLearn

In the ensemble methods documentation of Scikit-Learn http://scikit-learn.org/stable/modules/ensemble.html#id6 in section 1.9.2.3. Parameters we read:

(...) The best results are also usually reached when setting max_depth=None in combination with min_samples_split=1 (i.e., when fully developing the trees). Bear in mind though that these values are usually not optimal. The best parameter values should always be cross- validated.

So what is the difference between best results and optimal? I thought that by best results the author means best cross-validated prediction results.

In addition, note that bootstrap samples are used by default in random forests (bootstrap=True) while the default strategy is to use the original dataset for building extra-trees (bootstrap=False).

I understand this in the following way: bootstrapping is used by default in Scikit-Learns implementation but the default strategy is to not use bootstrapping. If so then what is the source of the default strategy and why is it not the default in the implementation?

like image 929
Karol Przybylak Avatar asked Oct 19 '22 17:10

Karol Przybylak


1 Answers

I agree the first quote is self-contradictory. Maybe the following would be better:

The best results are also often reached with fully developed trees (max_depth=None and min_samples_split=1). Bear in mind though that these values are usually not guaranteed to be optimal. The best parameter values should always be cross-validated.

For the second quote, it compares the default value of the bootstrap parameter for random forests (RandomForestClassifier and RandomForestRegression) to extremely randomized trees as implemented in the classes ExtraTreesClassifier and ExtraTreesRegressor. The following might be more explicit:

In addition, note that bootstrap samples are used by default in random forests (bootstrap=True) while for building extra-trees the default strategy is to use the original dataset (bootstrap=False).

Please feel free to submit a PR with the fix if you find those formulations clearer to understand.

like image 57
ogrisel Avatar answered Nov 15 '22 04:11

ogrisel