I'm using a RandomForestClassifier
to classify samples with a binary outcome ("does not have the thing" vs "has the thing"). From the result of RandomForestClassifier.decision_path
, how do I determine which samples contributed to the classification decision?
The documentation says:
Returns:
indicator : sparse csr array, shape = [n_samples, n_nodes]
Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes.
n_nodes_ptr : array of size (n_estimators + 1, )
The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.
Unfortunately, these terms are opaque to me. indicator[x:y]
on a matrix of dimension [n_samples, n_nodes]
seems like it is a mistake (shouldn't it be indicator[sample, n_nodes_ptr[i]:n_nodes_ptr[i+1]]
?), but even then, I'm not sure what to do to take a "node indicator" and find what feature the node refers to. I can find examples using decision_path
for DecisionTreeClassifier
, but not for RandomForestClassifier
.
The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
n_estimators= The required number of trees in the Random Forest. The default value is 10. We can choose any number but need to take care of the overfitting issue. criterion= It is a function to analyze the accuracy of the split.
Understanding the output of RandomForestClassifier.decision_path
is easier when you realize that the sklearn
conventions are to put as much stuff as possible inside numpy
matrices.
decision_path
returns the horizontal concatenation of every decision tree's decision_path
, and the second return value informs you of the bounds of each sub-matrix. Using decision_path
on a RandomForestClassifier
is therefore equivalent to using decision_path
on each of RandomForestClassifier.estimators_
. For a one-row sample, you can walk the results like this:
indicators, index_by_tree = classifier.decision_path(data_row)
indices = zip(index_by_tree, index_by_tree[1:])
for tree_classifier, (begin, end) in zip(classifier.estimators_, indices):
tree = tree_classifier.tree_
node_indices = indicators[0, begin:end].indices
Instead of treating each node as a separate object, the tree instance has the following properties:
feature
value
children_left
children_right
Each are arrays or matrices documenting the features of tree nodes identified by their index. For instance, tree.feature[3]
tells you which feature node 3 tests against; tree.value
tells you about the tree's values as a 3D array, the first dimension being the node number and the last one containing the classification value and the threshold value. (I don't know what the second dimension is. It only has one element in my case.) tree.children_left[5]
tells you the node number of node 5's left child, and as you guessed, tree.children_right[6]
tells you the node number of node 6's right child.
In addition to these arrays, DecisionTreeClassifier.decision_path
is also an array, where decision_path[N]
is non-zero if node #N was visited in the decision process.
To walk back the features that were tested, you can do something like this:
for index in node_indices:
feature = tree.feature[index]
if feature >= 0:
features.add(feature) # where `features` is a set()
Note that this tells you about the features which were tested, nothing about their value or how they impacted the result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With