<p>I'm using a <code>RandomForestClassifier</code> to classify samples with a binary outcome ("does not have the thing" vs "has the thing"). From the result of <code>RandomForestClassifier.decision_path</code>, how do I determine which samples contributed to the classification decision?</p> <p>The documentation says:</p> <blockquote> <h3>Returns:</h3> <h3>indicator : sparse csr array, shape = [n_samples, n_nodes]</h3> <p>Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes.</p> <h3>n_nodes_ptr : array of size (n_estimators + 1, )</h3> <p>The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.</p> </blockquote> <p>Unfortunately, these terms are opaque to me. <code>indicator[x:y]</code> on a matrix of dimension <code>[n_samples, n_nodes]</code> seems like it is a mistake (shouldn't it be <code>indicator[sample, n_nodes_ptr[i]:n_nodes_ptr[i+1]]</code>?), but even then, I'm not sure what to do to take a "node indicator" and find what feature the node refers to. I can find examples using <code>decision_path</code> for <code>DecisionTreeClassifier</code>, but not for <code>RandomForestClassifier</code>.</p>

<p>Understanding the output of <code>RandomForestClassifier.decision_path</code> is easier when you realize that the <code>sklearn</code> conventions are to put as much stuff as possible inside <code>numpy</code> matrices.</p> <p><code>decision_path</code> returns the horizontal concatenation of every decision tree's <code>decision_path</code>, and the second return value informs you of the bounds of each sub-matrix. Using <code>decision_path</code> on a <code>RandomForestClassifier</code> is therefore equivalent to using <code>decision_path</code> on each of <code>RandomForestClassifier.estimators_</code>. For a one-row sample, you can walk the results like this:</p> <pre class="prettyprint"><code>indicators, index_by_tree = classifier.decision_path(data_row) indices = zip(index_by_tree, index_by_tree[1:]) for tree_classifier, (begin, end) in zip(classifier.estimators_, indices): tree = tree_classifier.tree_ node_indices = indicators[0, begin:end].indices </code></pre> <p>Instead of treating each node as a separate object, the tree instance has the following properties:</p> <ul> <li><code>feature</code></li> <li><code>value</code></li> <li><code>children_left</code></li> <li><code>children_right</code></li> </ul> <p>Each are arrays or matrices documenting the features of tree nodes identified by their index. For instance, <code>tree.feature[3]</code> tells you which feature node 3 tests against; <code>tree.value</code> tells you about the tree's values as a 3D array, the first dimension being the node number and the last one containing the classification value and the threshold value. (I don't know what the second dimension is. It only has one element in my case.) <code>tree.children_left[5]</code> tells you the <em>node number</em> of node 5's left child, and as you guessed, <code>tree.children_right[6]</code> tells you the node number of node 6's right child.</p> <p>In addition to <em>these</em> arrays, <code>DecisionTreeClassifier.decision_path</code> is also an array, where <code>decision_path[N]</code> is non-zero if node #N was visited in the decision process.</p> <p>To walk back the features that were tested, you can do something like this:</p> <pre class="prettyprint"><code>for index in node_indices: feature = tree.feature[index] if feature >= 0: features.add(feature) # where `features` is a set() </code></pre> <p>Note that this tells you about the features which were tested, nothing about their value or how they impacted the result.</p>

Using RandomForestClassifier.decision_path, how do I tell which samples the classifier used to make a decision?

Tags:

python

scikit-learn

random-forest

I'm using a RandomForestClassifier to classify samples with a binary outcome ("does not have the thing" vs "has the thing"). From the result of RandomForestClassifier.decision_path, how do I determine which samples contributed to the classification decision?

The documentation says:

Returns:

indicator : sparse csr array, shape = [n_samples, n_nodes]

Return a node indicator matrix where non zero elements indicates that the samples goes through the nodes.

n_nodes_ptr : array of size (n_estimators + 1, )

The columns from indicator[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.

Unfortunately, these terms are opaque to me. indicator[x:y] on a matrix of dimension [n_samples, n_nodes] seems like it is a mistake (shouldn't it be indicator[sample, n_nodes_ptr[i]:n_nodes_ptr[i+1]]?), but even then, I'm not sure what to do to take a "node indicator" and find what feature the node refers to. I can find examples using decision_path for DecisionTreeClassifier, but not for RandomForestClassifier.

870

asked Apr 24 '18 00:04

zneak

1 Answers

Understanding the output of RandomForestClassifier.decision_path is easier when you realize that the sklearn conventions are to put as much stuff as possible inside numpy matrices.

decision_path returns the horizontal concatenation of every decision tree's decision_path, and the second return value informs you of the bounds of each sub-matrix. Using decision_path on a RandomForestClassifier is therefore equivalent to using decision_path on each of RandomForestClassifier.estimators_. For a one-row sample, you can walk the results like this:

Click to copy

indicators, index_by_tree = classifier.decision_path(data_row)
indices = zip(index_by_tree, index_by_tree[1:])
for tree_classifier, (begin, end) in zip(classifier.estimators_, indices):
    tree = tree_classifier.tree_
    node_indices = indicators[0, begin:end].indices

Instead of treating each node as a separate object, the tree instance has the following properties:

feature
value
children_left
children_right

Each are arrays or matrices documenting the features of tree nodes identified by their index. For instance, tree.feature[3] tells you which feature node 3 tests against; tree.value tells you about the tree's values as a 3D array, the first dimension being the node number and the last one containing the classification value and the threshold value. (I don't know what the second dimension is. It only has one element in my case.) tree.children_left[5] tells you the node number of node 5's left child, and as you guessed, tree.children_right[6] tells you the node number of node 6's right child.

In addition to these arrays, DecisionTreeClassifier.decision_path is also an array, where decision_path[N] is non-zero if node #N was visited in the decision process.

To walk back the features that were tested, you can do something like this:

Click to copy

for index in node_indices:
    feature = tree.feature[index]
    if feature >= 0:
        features.add(feature)  # where `features` is a set()

Note that this tells you about the features which were tested, nothing about their value or how they impacted the result.

180

answered Oct 21 '22 18:10

zneak

Related questions
                            
                                Recursion - Python
                            
                                How to interpret tf.layers.dropout training arg
                            
                                ROS multiple subscriber video delay
                            
                                call_command makemigrations does not work on Elastic Beanstalk
                            
                                Pandas dot product with Multiindex
                            
                                `numpy.sum` vs. `ndarray.sum`
                            
                                Keras: Confusion matrix at every epoch
                            
                                excel download with Flask-RestPlus?
                            
                                Automatically generate custom migrations in Django
                            
                                Keras Lambda layer and variables : "TypeError: can't pickle _thread.lock objects"
                            
                                Opening already opened hdf5 file in write mode, using h5py
                            
                                Use JWT Token created by Python in Java
                            
                                Issue using qualitative brewer palettes in plotnine
                            
                                How to get back to default tensorflow version on google colab
                            
                                How to save Keras model progress into a file?
                            
                                Using tf.data.Dataset makes saved model bigger
                            
                                Extract only body text from arXiv articles formatted as .tex
                            
                                Python numpy: perform function on each pair of columns in a numpy 2-D array?
                            
                                zsh: /usr/local/bin/youtube-dl: bad interpreter: /usr/local/opt/python/bin/python2.7: no such file or directory
                            
                                How to batch delete buckets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using RandomForestClassifier.decision_path, how do I tell which samples the classifier used to make a decision?

Tags:

python

scikit-learn

random-forest

Returns:

indicator : sparse csr array, shape = [n_samples, n_nodes]

n_nodes_ptr : array of size (n_estimators + 1, )

zneak

People also ask

1 Answers

zneak

Recent Activity

Donate For Us