Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding max_features parameter in RandomForestRegressor

Tags:

scikit-learn

While constructing each tree in the random forest using bootstrapped samples, for each terminal node, we select m variables at random from p variables to find the best split (p is the total number of features in your data). My questions (for RandomForestRegressor) are:

1) What does max_features correspond to (m or p or something else)?

2) Are m variables selected at random from max_features variables (what is the value of m)?

3) If max_features corresponds to m, then why would I want to set it equal to p for regression (the default)? Where is the randomness with this setting (i.e., how is it different from bagging)?

Thanks.

like image 563
csankar69 Avatar asked May 29 '14 17:05

csankar69


People also ask

What does Max_features mean in random forest?

max_features: These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available in Python to assign maximum features.

What is Max_features in decision tree?

max_features: The number of features to consider when looking for the best split. If this value is not set, the decision tree will consider all features available to make the best split.

What are the parameters of random forest?

(The parameters of a random forest are the variables and thresholds used to split each node learned during training). Scikit-Learn implements a set of sensible default hyperparameters for all models, but these are not guaranteed to be optimal for a problem.

What is a good max depth in random forest?

Generally, we go with a max depth of 3, 5, or 7.


2 Answers

Straight from the documentation:

[max_features] is the size of the random subsets of features to consider when splitting a node.

So max_features is what you call m. When max_features="auto", m = p and no feature subset selection is performed in the trees, so the "random forest" is actually a bagged ensemble of ordinary regression trees. The docs go on to say that

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks

By setting max_features differently, you'll get a "true" random forest.

like image 154
Fred Foo Avatar answered Oct 12 '22 12:10

Fred Foo


@lynnyi, max_features is the number of features that are considered on a per-split level, rather than on the entire decision tree construction. More clear, during the construction of each decision tree, RF will still use all the features (n_features), but it only consider number of "max_features" features for node splitting. And the "max_features" features are randomly selected from the entire features. You could confirm this by plotting one decision tree from a RF with max_features=1, and check all the nodes of that tree to count the number of features involved.

like image 35
Zhendong Cao Avatar answered Oct 12 '22 12:10

Zhendong Cao