while using the RandomForestRegressor I noticed something strange. To illustrate the problem, here a small example. I applied the RandomForestRegressor on a test dataset and plotted the graph of the first tree in the forest. This gives me the following output:
Root_node:
mse=8.64
samples=2
value=20.4
Left_leaf:
mse=0
samples=1
value=24
Right_leaf:
mse=0
samples=1
value=18
First, I expected the root node to have a value of (24+18)/2=21
. But somehow it is 20.4.
However, even if this value is correct, how do I get a mse of 8.64?
From my point of view it is supposed to be: 1/2[(24-20.4)^2+(18-20.4)^2]=9.36
(under the assumption that the root value of 20.4 is correct)
My solution is: 1/2[(24-21)^2+(18-21)^2]=9
. This is also what I get if I just use the DecisionTreeRegressor.
Is there something wrong in the implementation of the RandomForestRegressor or am I completely wrong?
Here is my reproducible code:
import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import graphviz
# create example dataset
data = {'AGE': [91, 42, 29, 94, 85], 'TAX': [384, 223, 280, 666, 384], 'Y': [19, 21, 24, 13, 18]}
df = pd.DataFrame(data=data)
x = df[['AGE','TAX']]
y = df[['Y']]
rf_reg = RandomForestRegressor(max_depth=2, random_state=1)
rf_reg.fit(x,y)
# plot a single tree of forest
dot_data = tree.export_graphviz(rf_reg.estimators_[0], out_file=None, feature_names=x.columns)
graph = graphviz.Source(dot_data)
graph
and the output graph:
tl;dr
It is due to the bootstrap sampling.
In detail:
With the default setting bootstrap=True
, RF will use bootstrap sampling when building the individual trees; quoting from the Cross Validated thread Number of Samples per-Tree in a Random Forest:
If
bootstrap=True
, then for each tree, N samples are drawn randomly with replacement from the training set and the tree is built on this new version of the training data. This introduces randomness in the training procedure since trees will each be trained on slightly different training sets. In expectation, drawing N samples with replacement from a dataset of size N will select ~2/3 unique samples from the original set.
"With replacement" means that some samples may be chosen more than once, while others will be left out, with the total number of chosen samples remaining equal to the number of samples of the original dataset (here 5).
What actually has happened in the tree you show is that, despite Graphviz displaying samples=2
, this should be understood as the number of unique samples; there are in total 5 (bootstrap) samples in the root node: 2 copies of the sample with y=24
and 3 copies of the one with y=18
(recall that by the definition of the bootstrap sampling procedure, the root node here must contain 5 samples, neither more nor less).
Now the displayed values add up:
# value:
(2*24 + 3*18)/5
# 20.4
# mse:
(2*(24-20.4)**2 + 3*(18-20.4)**2)/5
# 8.64
There obviously seems to be some design choice, either in the Graphviz visualization or in the underlying DecisionTreeRegressor
, so that only the number of unique samples is stored/displayed, which may (or may not) be a reason for opening a Github issue, but this is how the situation is for now (to be honest, I am not sure myself that I would want the actual total number of samples being displayed here, including the duplicates due to bootstrap sampling).
The situation is similar with RF & Bagging classifier models - see respectively:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With