Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using RandomForestClassifier.decision_path, how do I tell which samples the classifier used to make a decision?

I'm using a RandomForestClassifier to classify samples with a binary outcome ("does not have the thing" vs "has the thing"). From the result of RandomForestClassifier.decision_path, how do I determine which samples contributed to the classification decision?

The documentation says:

Returns:

indicator : sparse csr array, shape = [n_samples, n_nodes]

Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes.

n_nodes_ptr : array of size (n_estimators + 1, )

The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.

Unfortunately, these terms are opaque to me. indicator[x:y] on a matrix of dimension [n_samples, n_nodes] seems like it is a mistake (shouldn't it be indicator[sample, n_nodes_ptr[i]:n_nodes_ptr[i+1]]?), but even then, I'm not sure what to do to take a "node indicator" and find what feature the node refers to. I can find examples using decision_path for DecisionTreeClassifier, but not for RandomForestClassifier.

like image 870
zneak Avatar asked Apr 24 '18 00:04

zneak


People also ask

Which parameter is used to manage a number of base estimators in Randomforestclassifier?

The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

What is criterion in random forest classifier?

n_estimators= The required number of trees in the Random Forest. The default value is 10. We can choose any number but need to take care of the overfitting issue. criterion= It is a function to analyze the accuracy of the split.


1 Answers

Understanding the output of RandomForestClassifier.decision_path is easier when you realize that the sklearn conventions are to put as much stuff as possible inside numpy matrices.

decision_path returns the horizontal concatenation of every decision tree's decision_path, and the second return value informs you of the bounds of each sub-matrix. Using decision_path on a RandomForestClassifier is therefore equivalent to using decision_path on each of RandomForestClassifier.estimators_. For a one-row sample, you can walk the results like this:

indicators, index_by_tree = classifier.decision_path(data_row)
indices = zip(index_by_tree, index_by_tree[1:])
for tree_classifier, (begin, end) in zip(classifier.estimators_, indices):
    tree = tree_classifier.tree_
    node_indices = indicators[0, begin:end].indices

Instead of treating each node as a separate object, the tree instance has the following properties:

  • feature
  • value
  • children_left
  • children_right

Each are arrays or matrices documenting the features of tree nodes identified by their index. For instance, tree.feature[3] tells you which feature node 3 tests against; tree.value tells you about the tree's values as a 3D array, the first dimension being the node number and the last one containing the classification value and the threshold value. (I don't know what the second dimension is. It only has one element in my case.) tree.children_left[5] tells you the node number of node 5's left child, and as you guessed, tree.children_right[6] tells you the node number of node 6's right child.

In addition to these arrays, DecisionTreeClassifier.decision_path is also an array, where decision_path[N] is non-zero if node #N was visited in the decision process.

To walk back the features that were tested, you can do something like this:

for index in node_indices:
    feature = tree.feature[index]
    if feature >= 0:
        features.add(feature)  # where `features` is a set()

Note that this tells you about the features which were tested, nothing about their value or how they impacted the result.

like image 180
zneak Avatar answered Oct 21 '22 18:10

zneak