interpreting Graphviz output for decision tree regression

Tags:

I'm curious what the value field is in the nodes of the decision tree produced by Graphviz when used for regression. I understand that this is the number of samples in each class that are separated by a split when using decision tree classification but I'm not sure what it means for regression.

My data has a 2 dimensional input and a 10 dimensional output. Here is an example of what a tree looks like for my regression problem:

enter image description here

produced using this code & visualized with webgraphviz

# X = (n x 2)  Y = (n x 10)  X_test = (m x 2)

input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
    f = tree.export_graphviz(reg, out_file=f)

941

asked Jan 03 '18 21:01

Renee Swischuk

1 Answers

What a regression tree actually returns as output is the mean value of the dependent variable (here Y) of the training samples that end up in the respective terminal nodes (leaves); these mean values are shown as lists named value in the picture, which are all of length 10 here, since your Y is 10-dimensional.

In other words, and using the leftmost terminal node (leaf) of your tree as an example:

The leaf consists of the 42 samples for which X[0] <= 0.675 and X[1] <= 0.5
The mean value of your 10-dimensional output for these 42 samples is given in the value list of this leave, which is of length 10 indeed, i.e. the mean of Y[0] is -152007.382, the mean of Y[1] is -206040.675 etc and the mean of Y[9] is 3211.487.

You can confirm that this is the case by predicting some samples (from your training or test set - it doesn't matter) and checking that your 10-dimensional result is one of the 4 value lists depicted in the terminal leaves above.

Additionally, you can confirm that, for each element in value, the weighted averages of the children nodes are equal to the respective element of the parent node. Again, using the first element of your 2 leftmost terminal nodes (leaves), we get:

(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858

i.e. the value[0] element of their parent node (the leftmost node in the intermediate level). One more example, this time for the first value elements of your 2 intermediate nodes:

(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822

which again agrees with the -0.0 first value element of your root node.

Judging from the value list of your root node, it seems that the mean values of all elements of your 10-dimensional Y are almost zero, which you can (and should) verify manually, as a final confirmation.

So, to wrap-up:

The value list of each node contains the mean Y values for the training samples "belonging" to the respective node
Additionally, for the terminal nodes (leaves), these lists are the actual outputs of the tree model (i.e. the output will always be one of these lists, depending on X)
For the root node, the value list contains the mean Y values for the whole of your training dataset

answered Oct 17 '22 14:10

desertnaut

Related questions
                            
                                Why does the importance parameter influence performance of Random Forest in R?
                            
                                Does imblearn pipeline turn off sampling for testing?
                            
                                How to create a simple Gradient Descent algorithm
                            
                                svm for binary data with hamming distance
                            
                                Which is a better method? libsvm or svmclassify?
                            
                                How can we use unsupervised learning techniques on a data-set, and then label the clusters?
                            
                                Ranking and scores in Recursive Feature Elimination (RFE) in scikit-learn
                            
                                What does 'Attempting to upgrade input file specified using deprecated transformation parameters' mean?
                            
                                Scikit-learn : roc_auc_score
                            
                                Using natural language processing to extract an address from a tweet
                            
                                Getting the maximum accuracy for a binary probabilistic classifier in scikit-learn
                            
                                Efficient algorithm to group points in clusters by distance between every two points
                            
                                What is Maximum Entropy?
                            
                                Get node list from random walk in networkX
                            
                                ValueError: Unknown label type in scikit-learn
                            
                                Policy Iteration vs Value Iteration
                            
                                Is Stochastic gradient descent a classifier or an optimizer? [closed]
                            
                                tensorflow object detection Fine-tuning a model from an existing checkpoint
                            
                                Proper way to save Transfer Learning model in Keras
                            
                                Standardization before or after categorical encoding?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

interpreting Graphviz output for decision tree regression

Tags:

machine-learning

graphviz

scikit-learn

regression

decision-tree

Renee Swischuk

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us