I'm curious what the value
field is in the nodes of the decision tree produced by Graphviz when used for regression. I understand that this is the number of samples in each class that are separated by a split when using decision tree classification but I'm not sure what it means for regression.
My data has a 2 dimensional input and a 10 dimensional output. Here is an example of what a tree looks like for my regression problem:
produced using this code & visualized with webgraphviz
# X = (n x 2) Y = (n x 10) X_test = (m x 2)
input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
f = tree.export_graphviz(reg, out_file=f)
1 Interpretation. The interpretation is simple: Starting from the root node, you go to the next nodes and the edges tell you which subsets you are looking at. Once you reach the leaf node, the node tells you the predicted outcome.
A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value.
What a regression tree actually returns as output is the mean value of the dependent variable (here Y) of the training samples that end up in the respective terminal nodes (leaves); these mean values are shown as lists named value
in the picture, which are all of length 10 here, since your Y is 10-dimensional.
In other words, and using the leftmost terminal node (leaf) of your tree as an example:
X[0] <= 0.675
and X[1] <= 0.5
value
list of this leave, which is of length 10 indeed, i.e. the mean of Y[0]
is -152007.382
, the mean of Y[1]
is -206040.675
etc and the mean of Y[9]
is 3211.487
.You can confirm that this is the case by predicting some samples (from your training or test set - it doesn't matter) and checking that your 10-dimensional result is one of the 4 value
lists depicted in the terminal leaves above.
Additionally, you can confirm that, for each element in value
, the weighted averages of the children nodes are equal to the respective element of the parent node. Again, using the first element of your 2 leftmost terminal nodes (leaves), we get:
(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858
i.e. the value[0]
element of their parent node (the leftmost node in the intermediate level). One more example, this time for the first value
elements of your 2 intermediate nodes:
(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822
which again agrees with the -0.0
first value
element of your root node.
Judging from the value
list of your root node, it seems that the mean values of all elements of your 10-dimensional Y are almost zero, which you can (and should) verify manually, as a final confirmation.
So, to wrap-up:
value
list of each node contains the mean Y values for the training samples "belonging" to the respective nodevalue
list contains the mean Y values for the whole of your training datasetIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With