Display more attributes in the decision tree

Tags:

I am currently viewing the decision tree using the following code. Is there a way that we can export some calculated fields as output too?

For example, is it possible to display the sum of an input attribute at each node, i.e. sum of feature 1 from 'X' data array in the leafs of the tree.

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:]  
y = iris.target
#%%
from sklearn.tree import DecisionTreeClassifier
alg=DecisionTreeClassifier( max_depth=5,min_samples_leaf=2, max_leaf_nodes = 10)
alg.fit(X,y)

#%%
## View tree
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(alg,out_file=None, node_ids = True, proportion = True, class_names = True, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
graph

enter image description here

509

asked Feb 10 '18 01:02

Praveen Gupta Sanka

1 Answers

There is plenty of discussion about decision trees in scikit-learn on the github page. There are answers on this SO question and this scikit-learn documentation page that provide the framework to get you started. With all the links out of the way, here are some functions that allow a user to address the question in a generalizable manner. The functions could be easily modified since I don't know if you mean all the leaves or each leaf individually. My approach is the latter.

The first function uses apply as a cheap way to find the indices of the leaf nodes. It's not necessary to achieve what you're asking, but I included it as a convenience since you mentioned you want to investigate leaf nodes and leaf node indices may be unknown a priori.

def find_leaves(X, clf):
    """A cheap function to find leaves of a DecisionTreeClassifier
    clf must be a fitted DecisionTreeClassifier
    """
    return set(clf.apply(X))

Result on the example:

find_leaves(X, alg)
{1, 7, 8, 9, 10, 11, 12}

The following function will return an array of values that satisfy the conditions of node and feature, where node is the index of the node from the tree that you want values for and feature is the column (or feature) that you want from X.

def node_feature_values(X, clf, node=0, feature=0, require_leaf=False):
    """this function will return an array of values 
    from the input array X. Array values will be limited to
     1. samples that passed through <node> 
     2. and from the feature <feature>.

    clf must be a fitted DecisionTreeClassifier
    """
    leaf_ids = find_leaves(X, clf)
    if (require_leaf and
        node not in leaf_ids):
        print("<require_leaf> is set, "
                "select one of these nodes:\n{}".format(leaf_ids))
        return

    # a sparse array that contains node assignment by sample
    node_indicator = clf.decision_path(X)
    node_array = node_indicator.toarray()

    # which samples at least passed through the node
    samples_in_node_mask = node_array[:,node]==1

    return X[samples_in_node_mask, feature]

Applied to the example:

values_arr = node_feature_values(X, alg, node=12, feature=0, require_leaf=True)

array([6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7,
       5.8, 6.4, 6.5, 7.7, 7.7, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.1, 6.4,
       7.4, 7.9, 6.4, 7.7, 6.3, 6.4, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7,
       6.3, 6.5, 6.2, 5.9])

Now the user can perform whatever mathematical operation is desired on the subset of samples for a given feature.

i.e. sum of feature 1 from 'X' data array in the leafs of the tree.

print("There are {} total samples in this node, "
      "{}% of the total".format(len(values_arr), len(values_arr) / float(len(X))*100))
print("Feature Sum: {}".format(values_arr.sum()))

There are 43 total samples in this node,28.666666666666668% of the total
Feature Sum: 286.69999999999993

Update
After re-reading the question, this is the only solution I can put together quickly that doesn't involve modifying scikit source code for export.py. Code below still relies on previously defined functions. This code modifies the dotstring via pydot and networkx.

# Load the data from `dot_data` variable, which you defined.
import pydot
dot_graph = pydot.graph_from_dot_data(dot_data)[0]

import networkx as nx
MG = nx.nx_pydot.from_pydot(dot_graph)

# Select a `feature` and edit the `dot` string in `networkx`.
feature = 0
for n in find_leaves(X, alg):
    nfv = node_feature_values(X, alg, node=n, feature=feature)
    MG.node[str(n)]['label'] = MG.node[str(n)]['label'] + "\nfeature_{} sum: {}".format(feature, nfv.sum())

# Export the `networkx` graph then plot using `graphviz.Source()`
new_dot_data = nx.nx_pydot.to_pydot(MG)
graph = graphviz.Source(new_dot_data.create_dot())
graph

custom decision tree graph

Notice all the leaves have the sum of values from X for feature 0. I think the best way to accomplish what you're asking would be to modify tree.py and/or export.py to natively support this feature.

143

answered Sep 30 '22 07:09

Kevin

Related questions
                            
                                How to catch `CParserError` when reading a CSV file
                            
                                PYTHON DLL load failed
                            
                                Using the absolute_sigma parameter in scipy.optimize.curve_fit
                            
                                Haystack says “Model could not be found for SearchResult”
                            
                                Can I turn off Python (PiP) SSL cert validation with an ENV variable?
                            
                                Dump data from malformed SQLite in Python
                            
                                Storing pandas DataFrame with mixed data and category into hdf5
                            
                                How to subclass list and trigger an event whenever the data change?
                            
                                What's the command to "reset" a bokeh plot?
                            
                                Re-compose a Tensor after tensor factorization
                            
                                How to run only unmarked tests in pytest
                            
                                Using python together with knitr
                            
                                Why modifying dict during iteration doesn't always raise exception?
                            
                                Jupyter, Interactive Matplotlib: Hide the toolbar of the interactive view
                            
                                Slow recursion in python
                            
                                Is it possible to add your own WordNet to a library?
                            
                                Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn
                            
                                String variable as href in lxml.builder
                            
                                What is the correct type hint for an empty list?
                            
                                Can't pickle coroutine objects when ProcessPoolExecutor is used in class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Display more attributes in the decision tree

Tags:

python

scikit-learn

decision-tree

pygraphviz

Praveen Gupta Sanka

People also ask

1 Answers

Kevin

Recent Activity

Donate For Us