Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is my decision tree creating a split that doesn't actually divide the samples?

Here's my basic code for two-feature classification of the well-known Iris dataset:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from graphviz import Source

iris = load_iris()
iris_limited = iris.data[:, [2, 3]] # This gets only petal length & width.

# I'm using the max depth as a way to avoid overfitting
# and simplify the tree since I'm using it for educational purposes
clf = DecisionTreeClassifier(criterion="gini",
                             max_depth=3,
                             random_state=42)

clf.fit(iris_limited, iris.target)

visualization_raw = export_graphviz(clf, 
                                    out_file=None,
                                    special_characters=True,
                                    feature_names=["length", "width"],
                                    class_names=iris.target_names,
                                    node_ids=True)

visualization_source = Source(visualization_raw)
visualization_png_bytes = visualization_source.pipe(format='png')
with open('my_file.png', 'wb') as f:
    f.write(visualization_png_bytes)

When I checked the visualization of my tree, I found this:

Tree visualization result

This is a fairly normal tree at first glance, but I noticed something odd about it. Node #6 has 46 samples total, only one of which is in versicolor, so the node is marked as virginica. This seems like a fairly reasonable place to stop. However, for some reason I can't understand, the algorithm decides to split further into nodes #7 and #8. But the odd thing is, the 1 versicolor still in there still gets misclassified, since both the nodes end up having the class of virginica anyway. Why is it doing this? Does it blindly look at only the Gini decrease without looking at whether it makes a difference at all - that seems like odd behaviour to me, and I can't find it documented anywhere.

Is it possible to disable, or is this in fact correct?

like image 709
naiveai Avatar asked Apr 28 '18 14:04

naiveai


1 Answers

Why is it doing this?

Because it gives more information about class of samples. You are right that this split does not change outcome of any prediction but the model is more certain now. Consider samples in node #8: Before the split, the model is around 98% confident that these samples are virginica. However, after the split, the model says these samples are virginica for sure.

Does it blindly look at only the Gini decrease without looking at whether it makes a difference at all

By default DecisionTree continues splitting until all leaf nodes are pure. There are some parameters affecting splitting behaviour. However it does not consider explicitly whether splitting a node make a difference in terms of predicting a label.

Is it possible to disable, or is this in fact correct?

I don't think there is a way to enforce DecisionTreeClassifier to not split if split produces two leaf node with the same label. However by carefully setting min_samples_leaf and/or min_impurity_decrease parameters, you can achieve a similar thing.

like image 188
Seljuk Gülcan Avatar answered Sep 19 '22 16:09

Seljuk Gülcan