I observed that scikit-learn clf.tree_.feature occasional return negative values. For example -2. As far as I understand clf.tree_.feature is supposed to return sequential order of the features. In case we have array of feature names
['feature_one', 'feature_two', 'feature_three']
, then -2 would refer to feature_two
. I am surprised with usage of negative index. In would make more sense to refer to feature_two
by index 1. (-2 is reference convenient for human digestion, not for machine processing). Am I reading it correctly?
Update: Here is an example:
def leaf_ordering():
X = np.genfromtxt('X.csv', delimiter=',')
Y = np.genfromtxt('Y.csv',delimiter=',')
dt = DecisionTreeClassifier(min_samples_leaf=10, random_state=99)
dt.fit(X, Y)
print(dt.tree_.feature)
Here are the files X and Y
Here is the output:
[ 8 9 -2 -2 9 4 -2 9 8 -2 -2 0 0 9 9 8 -2 -2 9 -2 -2 6 -2 -2 -2
2 -2 9 8 6 9 -2 -2 -2 8 9 -2 9 6 -2 -2 -2 6 -2 -2 9 -2 6 -2 -2
2 -2 -2]
Value is how the samples to test for information gain are split up. So at the root node, 32561 samples are divided into 24720 and 7841 samples each.
max_leaf_nodes – Maximum number of leaf nodes a decision tree can have. max_features – Maximum number of features that are taken into the account for splitting each node.
max_depth: This determines the maximum depth of the tree. In our case, we use a depth of two to make our decision tree. The default value is set to none. This will often result in over-fitted decision trees.
By reading the Cython source code for the tree generator we see that the -2's are just dummy values for the leaf nodes's feature split attribute.
Line 63
TREE_UNDEFINED = -2
Line 359
if is_leaf:
# Node is not expandable; set node as leaf
node.left_child = _TREE_LEAF
node.right_child = _TREE_LEAF
node.feature = _TREE_UNDEFINED
node.threshold = _TREE_UNDEFINED
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With