Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Display more attributes in the decision tree

I am currently viewing the decision tree using the following code. Is there a way that we can export some calculated fields as output too?

For example, is it possible to display the sum of an input attribute at each node, i.e. sum of feature 1 from 'X' data array in the leafs of the tree.

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:]  
y = iris.target
#%%
from sklearn.tree import DecisionTreeClassifier
alg=DecisionTreeClassifier( max_depth=5,min_samples_leaf=2, max_leaf_nodes = 10)
alg.fit(X,y)

#%%
## View tree
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(alg,out_file=None, node_ids = True, proportion = True, class_names = True, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
graph

enter image description here

like image 509
Praveen Gupta Sanka Avatar asked Feb 10 '18 01:02

Praveen Gupta Sanka


People also ask

What is attribute in decision tree?

A decision tree is a tree where each node represents a feature(attribute), each link(branch) represents a decision(rule) and each leaf represents an outcome(categorical or continues value).

What does a decision tree display?

A decision tree is a map of the possible outcomes of a series of related choices. It allows an individual or organization to weigh possible actions against one another based on their costs, probabilities, and benefits.

Can a decision tree have more than 2 splits?

Chi-square is another method of splitting nodes in a decision tree for datasets having categorical target values. It can make two or more than two splits. It works on the statistical significance of differences between the parent node and child nodes.


1 Answers

There is plenty of discussion about decision trees in scikit-learn on the github page. There are answers on this SO question and this scikit-learn documentation page that provide the framework to get you started. With all the links out of the way, here are some functions that allow a user to address the question in a generalizable manner. The functions could be easily modified since I don't know if you mean all the leaves or each leaf individually. My approach is the latter.

The first function uses apply as a cheap way to find the indices of the leaf nodes. It's not necessary to achieve what you're asking, but I included it as a convenience since you mentioned you want to investigate leaf nodes and leaf node indices may be unknown a priori.

def find_leaves(X, clf):
    """A cheap function to find leaves of a DecisionTreeClassifier
    clf must be a fitted DecisionTreeClassifier
    """
    return set(clf.apply(X))

Result on the example:

find_leaves(X, alg)
{1, 7, 8, 9, 10, 11, 12}

The following function will return an array of values that satisfy the conditions of node and feature, where node is the index of the node from the tree that you want values for and feature is the column (or feature) that you want from X.

def node_feature_values(X, clf, node=0, feature=0, require_leaf=False):
    """this function will return an array of values 
    from the input array X. Array values will be limited to
     1. samples that passed through <node> 
     2. and from the feature <feature>.

    clf must be a fitted DecisionTreeClassifier
    """
    leaf_ids = find_leaves(X, clf)
    if (require_leaf and
        node not in leaf_ids):
        print("<require_leaf> is set, "
                "select one of these nodes:\n{}".format(leaf_ids))
        return

    # a sparse array that contains node assignment by sample
    node_indicator = clf.decision_path(X)
    node_array = node_indicator.toarray()

    # which samples at least passed through the node
    samples_in_node_mask = node_array[:,node]==1

    return X[samples_in_node_mask, feature]

Applied to the example:

values_arr = node_feature_values(X, alg, node=12, feature=0, require_leaf=True)

array([6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7,
       5.8, 6.4, 6.5, 7.7, 7.7, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.1, 6.4,
       7.4, 7.9, 6.4, 7.7, 6.3, 6.4, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7,
       6.3, 6.5, 6.2, 5.9])

Now the user can perform whatever mathematical operation is desired on the subset of samples for a given feature.

i.e. sum of feature 1 from 'X' data array in the leafs of the tree.

print("There are {} total samples in this node, "
      "{}% of the total".format(len(values_arr), len(values_arr) / float(len(X))*100))
print("Feature Sum: {}".format(values_arr.sum()))

There are 43 total samples in this node,28.666666666666668% of the total
Feature Sum: 286.69999999999993

Update
After re-reading the question, this is the only solution I can put together quickly that doesn't involve modifying scikit source code for export.py. Code below still relies on previously defined functions. This code modifies the dotstring via pydot and networkx.

# Load the data from `dot_data` variable, which you defined.
import pydot
dot_graph = pydot.graph_from_dot_data(dot_data)[0]

import networkx as nx
MG = nx.nx_pydot.from_pydot(dot_graph)

# Select a `feature` and edit the `dot` string in `networkx`.
feature = 0
for n in find_leaves(X, alg):
    nfv = node_feature_values(X, alg, node=n, feature=feature)
    MG.node[str(n)]['label'] = MG.node[str(n)]['label'] + "\nfeature_{} sum: {}".format(feature, nfv.sum())

# Export the `networkx` graph then plot using `graphviz.Source()`
new_dot_data = nx.nx_pydot.to_pydot(MG)
graph = graphviz.Source(new_dot_data.create_dot())
graph

custom decision tree graph

Notice all the leaves have the sum of values from X for feature 0. I think the best way to accomplish what you're asking would be to modify tree.py and/or export.py to natively support this feature.

like image 143
Kevin Avatar answered Sep 30 '22 07:09

Kevin