Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do the values that `graphviz` renders inside each node of a decision tree mean?

enter image description here

For the image above using the AdaBoostClassifier library from scipy and graphviz I was able to create this subtree visual and I need help interpreting the values that are in each node? Like for example, what does "gini" mean? What is the significance of the "samples" and "value" fields? What does it mean that attribute F5 <= 0.5?

Here is my code (I did this all in jupyter notebook):

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

f = open('dtree-data.txt')
d = dict()
for i in range(1,9):
    key = 'F' + str(i)
    d[key] = []
d['RES'] = []
for line in f:
    values = [(True if x == 'True' else False) for x in line.split()[:8]]
    result = line.split()[8]
    d['RES'].append(result)
    for i in range(1, 9):
        key = 'F' + str(i)
        d[key].append(values[i-1])
df = pd.DataFrame(data=d, columns=['F1','F2','F3','F4','F5','F6','F7','F8','RES'])

from sklearn.model_selection import train_test_split

X = df.drop('RES', axis=1)
y = df['RES']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)

from IPython.display import Image
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydot

# https://stackoverflow.com/questions/46192063/not-fitted-error-when-using-sklearns-graphviz 

sub_tree = ada.estimators_[0]
dot_data = StringIO()
features = list(df.columns[1:])
export_graphviz(sub_tree, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())

NOTE: External packages may need to be installed in order to view the data locally (obviously)

Here is a link to the data file: https://cs.rit.edu/~jro/courses/intelSys/dtree-data

like image 569
Q.H. Avatar asked Nov 27 '17 03:11

Q.H.


People also ask

What does value in decision tree mean?

value is the split of the samples at each node. so at the root node, 32561 samples are divided into two child nodes of 24720 and 7841 samples each.

How do you explain a decision tree plot?

A Decision Tree is a supervised algorithm used in machine learning. It is using a binary tree graph (each node has two children) to assign for each data sample a target value. The target values are presented in the tree leaves. To reach to the leaf, the sample is propagated through nodes, starting at the root node.

Which function is used to export decision tree?

Export a decision tree in DOT format. The sample counts that are shown are weighted with any sample_weights that might be present.


1 Answers

A decision tree is a binary tree where each node represents a portion of the data. Each node that is not a leaf (root or branch) splits its part of the data in two sub-parts. The root node contains all data (from the training set). Furthermore, this is a classification tree. It predicts class probabilities - the node values.

Root/branch node:

  • samples = 134 that means the node 'contains' 134 samples. Since it's the root node that means the tree was trained on 134 samples.
  • value = [0.373, 0.627] are class frequencies. About 1/3 of the samples belong to class A and 2/3 to class B.
  • gini = 0.468 is the gini impurity of the node. It discribes how much the classes are mixed up.
  • F5 <= 0.5 What are the column names of the data? Right. This means that the node is split so that all samples where the feature F5 is lower than 0.5 go to the left child and the samples where the feature is higher than 0.5 go to the right child.

Leaf nodes:

  • These nodes are not further split, so there is no need for a F <= something field.
  • samples = 90 / 44 sum to 134. 90 samples went to the left child and 44 samples to the right child.
  • values = [0.104, 0.567] / [0.269, 0.06] are the class frequencies in the children. Most samples in the left child belong to class B (56% vs 10%) and most samples in the right child belong to class A (27% v 6%).
  • gini = 0.263 / 0.298 are the remaining impurities in the child nodes. They are lower than in the parent node, which means the split improved separability between the classes, but there is still some uncertainty left.
like image 160
MB-F Avatar answered Sep 30 '22 08:09

MB-F