Here's my basic code for two-feature classification of the well-known Iris dataset:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from graphviz import Source
iris = load_iris()
iris_limited = iris.data[:, [2, 3]] # This gets only petal length & width.
# I'm using the max depth as a way to avoid overfitting
# and simplify the tree since I'm using it for educational purposes
clf = DecisionTreeClassifier(criterion="gini",
max_depth=3,
random_state=42)
clf.fit(iris_limited, iris.target)
visualization_raw = export_graphviz(clf,
out_file=None,
special_characters=True,
feature_names=["length", "width"],
class_names=iris.target_names,
node_ids=True)
visualization_source = Source(visualization_raw)
visualization_png_bytes = visualization_source.pipe(format='png')
with open('my_file.png', 'wb') as f:
f.write(visualization_png_bytes)
When I checked the visualization of my tree, I found this:
This is a fairly normal tree at first glance, but I noticed something odd about it. Node #6 has 46 samples total, only one of which is in versicolor, so the node is marked as virginica. This seems like a fairly reasonable place to stop. However, for some reason I can't understand, the algorithm decides to split further into nodes #7 and #8. But the odd thing is, the 1 versicolor still in there still gets misclassified, since both the nodes end up having the class of virginica anyway. Why is it doing this? Does it blindly look at only the Gini decrease without looking at whether it makes a difference at all - that seems like odd behaviour to me, and I can't find it documented anywhere.
Is it possible to disable, or is this in fact correct?
Why is it doing this?
Because it gives more information about class of samples. You are right that this split does not change outcome of any prediction but the model is more certain now. Consider samples in node #8: Before the split, the model is around 98% confident that these samples are virginica. However, after the split, the model says these samples are virginica for sure.
Does it blindly look at only the Gini decrease without looking at whether it makes a difference at all
By default DecisionTree continues splitting until all leaf nodes are pure. There are some parameters affecting splitting behaviour. However it does not consider explicitly whether splitting a node make a difference in terms of predicting a label.
Is it possible to disable, or is this in fact correct?
I don't think there is a way to enforce DecisionTreeClassifier to not split if split produces two leaf node with the same label. However by carefully setting min_samples_leaf
and/or min_impurity_decrease
parameters, you can achieve a similar thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With