I would like to learn more about the Random Forest Regressors I am building with sklearn. For example, which depth do the trees have on average if I do not regularise?
The reason for this is that I need to regularise the model and want to get a feeling for what the model looks like at the moment. Also, if I set e.g. max_leaf_nodes will it still be necessary to also restrict max_depth or will this "problem" sort of solve itself because the tree cannot be grown too deep it max_leaf_nodes is set. Does this make sense or am I thinking in the wrong direction? I could not find anything in this direction.
If you want to know the average maximum depth of the trees constituting your Random Forest model, you have to access each tree singularly and inquiry for its maximum depth, and then compute a statistic out of the results you obtain.
Let's first make a reproducible example of a Random Forest classifier model (taken from Scikit-learn documentation)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
clf = RandomForestClassifier(n_estimators=100,
random_state=0)
clf.fit(X, y)
Now we can iterate over its estimators_ attribute containing each decision tree. For each decision tree, we inquiry the attribute tree_.max_depth, store away the response and take an average after completing our iteration:
max_depth = list()
for tree in clf.estimators_:
max_depth.append(tree.tree_.max_depth)
print("avg max depth %0.1f" % (sum(max_depth) / len(max_depth)))
This will provide you an idea of the average maximum depth of each tree composing your Random Forest model (it works exactly the same also for a regressor model, as you have asked about).
Anyway, as a suggestion, if you want to regularize your model, you have better test parameter hypothesis under a cross-validation and grid/random search paradigm. In such a context you actually don't need to question yourself how hyperparameters interact with each other, you just test different combinations and you get the best combination based on cross validation score.
In addition to @Luca Massaron's answer:
I found https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py which can be applied to each tree in the forest using
for tree in clf.estimators_:
The number of leaf nodes can be calculated like this:
n_leaves = np.zeros(n_trees, dtype=int)
for i in range(n_trees):
n_nodes = clf.estimators_[i].tree_.node_count
# use left or right children as you want
children_left = clf.estimators_[i].tree_.children_left
for x in range(n_nodes):
if children_left[x] == -1:
n_leaves[i] += 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With