Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is there any way to get samples under each leaf of a decision tree?

I have trained a decision tree using a dataset. Now I want to see which samples fall under which leaf of the tree.

From here I want the red circled samples.

enter image description here

I am using Python's Sklearn's implementation of decision tree .

like image 848
Farshid Rayhan Avatar asked Jul 30 '17 10:07

Farshid Rayhan


People also ask

What does each leaf in a decision tree represent?

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes).

How do you count leaves on a decision tree?

Count the number of observations in the k 'th leaf. Divide by the total number of observations in all the leaf nodes. After looping over all leafs, the result of each leaf is summed for a final gini impurity.

How many leaves can a decision tree have?

The maximum depth of a decision tree is simply the largest possible length between the root to a leaf. A tree of maximum length kk can have at most 2^k2k leaves.

How do you split a decision tree?

The process of splitting a single node into many nodes is known as splitting. A leaf node, also known as a terminal node, is a node that does not break into other nodes. A branch, sometimes known as a sub-tree, is a section of a decision tree. Splitting is not the only concept that is diametrically opposite it.


1 Answers

If you want only the leaf for each sample you can just use

clf.apply(iris.data)

array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 14, 5, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 16, 16, 16, 16, 16, 16, 6, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 8, 16, 16, 16, 16, 16, 16, 15, 16, 16, 11, 16, 16, 16, 8, 8, 16, 16, 16, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16])

If you want to get all samples for each node you could calculate all the decision paths with

dec_paths = clf.decision_path(iris.data)

Then loop over the decision paths, convert them to arrays with toarray() and check whether they belong to a node or not. Everything is stored in a defaultdict where the key is the node number and the values are the sample number.

for d, dec in enumerate(dec_paths):
    for i in range(clf.tree_.node_count):
        if dec.toarray()[0][i] == 1:
            samples[i].append(d)

Complete code

import sklearn.datasets
import sklearn.tree
import collections

clf = sklearn.tree.DecisionTreeClassifier(random_state=42)
iris = sklearn.datasets.load_iris()
clf = clf.fit(iris.data, iris.target)

samples = collections.defaultdict(list)
dec_paths = clf.decision_path(iris.data)

for d, dec in enumerate(dec_paths):
    for i in range(clf.tree_.node_count):
        if dec.toarray()[0][i] == 1:
            samples[i].append(d) 

Output

print(samples[13])

[70, 126, 138]

like image 127
Maximilian Peters Avatar answered Oct 24 '22 01:10

Maximilian Peters