Understanding max_features parameter in RandomForestRegressor

Tags:

scikit-learn

While constructing each tree in the random forest using bootstrapped samples, for each terminal node, we select m variables at random from p variables to find the best split (p is the total number of features in your data). My questions (for RandomForestRegressor) are:

1) What does max_features correspond to (m or p or something else)?

2) Are m variables selected at random from max_features variables (what is the value of m)?

3) If max_features corresponds to m, then why would I want to set it equal to p for regression (the default)? Where is the randomness with this setting (i.e., how is it different from bagging)?

Thanks.

563

asked May 29 '14 17:05

csankar69

2 Answers

Straight from the documentation:

[max_features] is the size of the random subsets of features to consider when splitting a node.

So max_features is what you call m. When max_features="auto", m = p and no feature subset selection is performed in the trees, so the "random forest" is actually a bagged ensemble of ordinary regression trees. The docs go on to say that

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks

By setting max_features differently, you'll get a "true" random forest.

154

answered Oct 12 '22 12:10

Fred Foo

@lynnyi, max_features is the number of features that are considered on a per-split level, rather than on the entire decision tree construction. More clear, during the construction of each decision tree, RF will still use all the features (n_features), but it only consider number of "max_features" features for node splitting. And the "max_features" features are randomly selected from the entire features. You could confirm this by plotting one decision tree from a RF with max_features=1, and check all the nodes of that tree to count the number of features involved.

answered Oct 12 '22 12:10

Zhendong Cao

Related questions
                            
                                How to write a custom estimator in sklearn and use cross-validation on it?
                            
                                Using GridSearchCV with AdaBoost and DecisionTreeClassifier
                            
                                TypeError: only integer arrays with one element can be converted to an index
                            
                                label-encoder encoding missing values
                            
                                Insert or delete a step in scikit-learn Pipeline
                            
                                scikit-learn - ROC curve with confidence intervals
                            
                                tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer
                            
                                how to implement walk forward testing in sklearn?
                            
                                sklearn - Cross validation with multiple scores
                            
                                confused about random_state in decision tree of scikit learn
                            
                                GridSearchCV - XGBoost - Early Stopping
                            
                                Early stopping with Keras and sklearn GridSearchCV cross-validation
                            
                                SKlearn import MLPClassifier fails
                            
                                How do you access tree depth in Python's scikit-learn?
                            
                                Will pandas dataframe object work with sklearn kmeans clustering?
                            
                                Adding words to scikit-learn's CountVectorizer's stop list
                            
                                ImportError: cannot import name 'cross_validation' from 'sklearn' [duplicate]
                            
                                Got continuous is not supported error in RandomForestRegressor
                            
                                Issue with OneHotEncoder for categorical features
                            
                                TypeError: fit() missing 1 required positional argument: 'y'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With