I am a beginner in machine learning and am experimenting with decision trees. I am looking at this visualization of a decision tree http://scikit-learn.org/dev/_images/iris.svg and wondering at what the error value signifies . Is it the Gini Index or the Information gain or what ?. Would also appreciate what it intuitively means.
In this concrete example, the "error" of a node is the Gini Index of all examples that reached that node.
In general, the "error" of a node depends on the concrete impurity criterion (e.g. gini or entropy for classification and mean squared error for regression).
Intuitively you can think of both impurity criteria (gini and entropy) as a measure how homogeneous a multi set is. A multi set is homogeneous if it contains mostly elements of one type (this is also called "pure" thus the name "impurity criterion"). In our case the elements of the multi set are the class labels that reach the corresponding node. When we split a node we want that the resulting partitions are pure - meaning that the classes are well separated (a partition contains mostly instances of one class).
In the case of criterion="entropy"
and binary classification an error of 1.0 means that there is an equal number of positive and negative examples in the node (the most in-homogeneous multi set).
You can access the tree data structure that underlies a DecisionTreeClassifier
or DecisionTreeRegressor
via its tree_
attribute which holds an on object of the extension type sklearn.tree._tree.Tree
. This object represents the tree as a series of parallel numpy arrays. The array init_error
hold the initial error of each node; best_error
holds sum of the errors of the two partitions if the node is a splitting node.
See the class documentation in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L45 for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With