Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

interpreting Graphviz output for decision tree regression

I'm curious what the value field is in the nodes of the decision tree produced by Graphviz when used for regression. I understand that this is the number of samples in each class that are separated by a split when using decision tree classification but I'm not sure what it means for regression.

My data has a 2 dimensional input and a 10 dimensional output. Here is an example of what a tree looks like for my regression problem:

enter image description here

produced using this code & visualized with webgraphviz

# X = (n x 2)  Y = (n x 10)  X_test = (m x 2)

input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
    f = tree.export_graphviz(reg, out_file=f)
like image 941
Renee Swischuk Avatar asked Jan 03 '18 21:01

Renee Swischuk


People also ask

How do you interpret the results of a decision tree?

1 Interpretation. The interpretation is simple: Starting from the root node, you go to the next nodes and the edges tell you which subsets you are looking at. Once you reach the leaf node, the node tells you the predicted outcome.

How do you interpret a decision tree in Python?

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value.


1 Answers

What a regression tree actually returns as output is the mean value of the dependent variable (here Y) of the training samples that end up in the respective terminal nodes (leaves); these mean values are shown as lists named value in the picture, which are all of length 10 here, since your Y is 10-dimensional.

In other words, and using the leftmost terminal node (leaf) of your tree as an example:

  • The leaf consists of the 42 samples for which X[0] <= 0.675 and X[1] <= 0.5
  • The mean value of your 10-dimensional output for these 42 samples is given in the value list of this leave, which is of length 10 indeed, i.e. the mean of Y[0] is -152007.382, the mean of Y[1] is -206040.675 etc and the mean of Y[9] is 3211.487.

You can confirm that this is the case by predicting some samples (from your training or test set - it doesn't matter) and checking that your 10-dimensional result is one of the 4 value lists depicted in the terminal leaves above.

Additionally, you can confirm that, for each element in value, the weighted averages of the children nodes are equal to the respective element of the parent node. Again, using the first element of your 2 leftmost terminal nodes (leaves), we get:

(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858

i.e. the value[0] element of their parent node (the leftmost node in the intermediate level). One more example, this time for the first value elements of your 2 intermediate nodes:

(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822

which again agrees with the -0.0 first value element of your root node.

Judging from the value list of your root node, it seems that the mean values of all elements of your 10-dimensional Y are almost zero, which you can (and should) verify manually, as a final confirmation.


So, to wrap-up:

  • The value list of each node contains the mean Y values for the training samples "belonging" to the respective node
  • Additionally, for the terminal nodes (leaves), these lists are the actual outputs of the tree model (i.e. the output will always be one of these lists, depending on X)
  • For the root node, the value list contains the mean Y values for the whole of your training dataset
like image 52
desertnaut Avatar answered Oct 17 '22 14:10

desertnaut