I have trained a decision tree using a dataset. Now I want to see which samples fall under which leaf of the tree. From here I want the red circled samples. <img src="https://i.stack.imgur.com/DYhwf.png" alt="enter image description here"> I am using Python's Sklearn's implementation of decision tree .

If you want only the leaf for each sample you can just use <pre class="prettyprint"><code>clf.apply(iris.data) </code></pre> <blockquote> array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 14, 5, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 16, 16, 16, 16, 16, 16, 6, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 8, 16, 16, 16, 16, 16, 16, 15, 16, 16, 11, 16, 16, 16, 8, 8, 16, 16, 16, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]) </blockquote> If you want to get all samples for each node you could calculate all the decision paths with <pre class="prettyprint"><code>dec_paths = clf.decision_path(iris.data) </code></pre> Then loop over the decision paths, convert them to arrays with <code>toarray()</code> and check whether they belong to a node or not. Everything is stored in a <code>defaultdict</code> where the key is the node number and the values are the sample number. <pre class="prettyprint"><code>for d, dec in enumerate(dec_paths): for i in range(clf.tree_.node_count): if dec.toarray()[0][i] == 1: samples[i].append(d) </code></pre> Complete code <pre class="prettyprint"><code>import sklearn.datasets import sklearn.tree import collections clf = sklearn.tree.DecisionTreeClassifier(random_state=42) iris = sklearn.datasets.load_iris() clf = clf.fit(iris.data, iris.target) samples = collections.defaultdict(list) dec_paths = clf.decision_path(iris.data) for d, dec in enumerate(dec_paths): for i in range(clf.tree_.node_count): if dec.toarray()[0][i] == 1: samples[i].append(d) </code></pre> Output <pre class="prettyprint"><code>print(samples[13]) </code></pre> <blockquote> [70, 126, 138] </blockquote>

is there any way to get samples under each leaf of a decision tree?

1 Answers

If you want only the leaf for each sample you can just use

clf.apply(iris.data)

array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 14, 5, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 16, 16, 16, 16, 16, 16, 6, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 8, 16, 16, 16, 16, 16, 16, 15, 16, 16, 11, 16, 16, 16, 8, 8, 16, 16, 16, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16])

If you want to get all samples for each node you could calculate all the decision paths with

dec_paths = clf.decision_path(iris.data)

Then loop over the decision paths, convert them to arrays with toarray() and check whether they belong to a node or not. Everything is stored in a defaultdict where the key is the node number and the values are the sample number.

for d, dec in enumerate(dec_paths):
    for i in range(clf.tree_.node_count):
        if dec.toarray()[0][i] == 1:
            samples[i].append(d)

Complete code

import sklearn.datasets
import sklearn.tree
import collections

clf = sklearn.tree.DecisionTreeClassifier(random_state=42)
iris = sklearn.datasets.load_iris()
clf = clf.fit(iris.data, iris.target)

samples = collections.defaultdict(list)
dec_paths = clf.decision_path(iris.data)

for d, dec in enumerate(dec_paths):
    for i in range(clf.tree_.node_count):
        if dec.toarray()[0][i] == 1:
            samples[i].append(d)

Output

print(samples[13])

[70, 126, 138]

127

answered Oct 24 '22 01:10

Maximilian Peters

Related questions
                            
                                find out all child elements xpath from parent xpath using selenium webdriver in python
                            
                                PyCharm Python Console - Printing on the same line not working as intended
                            
                                Find index where elements change value pandas dataframe
                            
                                attach img file in pdf weasyprint
                            
                                pytorch Network.parameters() missing 1 required positional argument: 'self'
                            
                                How to create a grouped bar chart in Altair?
                            
                                Where is the luigi config file?
                            
                                Setting both axes logarithmic in bar plot matploblib
                            
                                Why does insert script using cx_Oracle hangs
                            
                                How do I increase decimal precision in Spark?
                            
                                error with snappy while importing fastparquet in python
                            
                                How to set default_app_config for Django with apps directory structure?
                            
                                python: pandas np.where vs. df.loc with multiple conditions
                            
                                split a numpy array both horizontally and vertically
                            
                                How to draw a classic stock chart with matplotlib?
                            
                                _pickle.UnpicklingError: invalid load key, 'x'
                            
                                TypeError: Object of type 'Tag' is not JSON serializable
                            
                                Install LabelImg Annotation tool in Windows
                            
                                Pandas df.itertuples renaming dataframe columns when printing
                            
                                Python, Seaborn: Plotting frequencies with zero-values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

is there any way to get samples under each leaf of a decision tree?

Tags:

python

machine-learning

scikit-learn

decision-tree

Farshid Rayhan

People also ask

1 Answers

Maximilian Peters

Recent Activity

Donate For Us