Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Current node to next node feature combinations in decision tree learning: useful to determine potential interactions?

Using some guidance from this scikit-learn tutorial on understanding decision tree structures, I had the idea that perhaps looking at combinations of features occurring between two connected nodes might give some insight as to potential "interaction" terms. That is, by looking at how often a given feature y follows a given feature x, we might be able to determine whether there's some higher order interaction between x and y, versus other variables in the model.

Here is my setup. Basically this object just parses the structure of the tree and makes it easy for us to traverse nodes and determine what's happening at each node.

import numpy as np

class TreeInteractionFinder(object):

    def __init__(
        self,
        model,
        feature_names = None):

        self.model = model
        self.feature_names = feature_names

        self._parse_tree_structure()
        self._node_and_leaf_compute()

    def _parse_tree_structure(self):
        self.n_nodes = self.model.tree_.node_count
        self.children_left = self.model.tree_.children_left
        self.children_right = self.model.tree_.children_right
        self.feature = self.model.tree_.feature
        self.threshold = self.model.tree_.threshold
        self.n_node_samples = self.model.tree_.n_node_samples
        self.predicted_values = self.model.tree_.value

    def _node_and_leaf_compute(self):
        ''' Compute node depth and whether each node is a leaf '''
        node_depth = np.zeros(shape=self.n_nodes, dtype=np.int64)
        is_leaves = np.zeros(shape=self.n_nodes, dtype=bool)
        # Seed is the root node id and its parent depth
        stack = [(0, -1)]
        while stack:
            node_idx, parent_depth = stack.pop()
            node_depth[node_idx] = parent_depth + 1

            # If we have a test (where "test" means decision-test) node
            if self.children_left[node_idx] != self.children_right[node_idx]:
                stack.append((self.children_left[node_idx], parent_depth + 1))
                stack.append((self.children_right[node_idx], parent_depth + 1))
            else:
                is_leaves[node_idx] = True

        self.is_leaves = is_leaves
        self.node_depth = node_depth

Next, I'll train a somewhat-deep tree on some dataset. The Boston housing dataset gave me some interesting results, and thus I've used it in my example:

from sklearn.datasets import load_boston as load_dataset
from sklearn.tree import DecisionTreeRegressor as model

bunch = load_dataset()

X, y = bunch.data, bunch.target
feature_names = bunch.feature_names

model = model(
    max_depth=20,
    min_samples_leaf=2
)

model.fit(X, y)

finder = TreeInteractionFinder(model, feature_names)

from collections import defaultdict
feature_combos = defaultdict(int)

# Traverse the tree fully, counting the occurrences of features at the current and next indices
for idx in range(finder.n_nodes):
    curr_node_is_leaf = finder.is_leaves[idx]
    curr_feature = finder.feature_names[finder.feature[idx]]
    if not curr_node_is_leaf:
        # Test to see if we're at the end of the tree
        try:
            next_idx = finder.feature[idx + 1]
        except IndexError:
            break
        else:
            next_node_is_leaf = finder.is_leaves[next_idx]
            if not next_node_is_leaf:
                next_feature = finder.feature_names[next_idx]
                feature_combos[frozenset({curr_feature, next_feature})] += 1

from pprint import pprint
pprint(sorted(feature_combos.items(), key=lambda x: -x[1]))
pprint(sorted(zip(feature_names, model.feature_importances_), key=lambda x: -x[1]))

Which yields:

$ python3 *py
[(frozenset({'AGE', 'LSTAT'}), 4),
 (frozenset({'RM', 'LSTAT'}), 3),
 (frozenset({'AGE', 'NOX'}), 3),
 (frozenset({'NOX', 'CRIM'}), 3),
 (frozenset({'NOX', 'DIS'}), 3),
 (frozenset({'LSTAT', 'DIS'}), 2),
 (frozenset({'AGE', 'RM'}), 2),
 (frozenset({'AGE', 'DIS'}), 2),
 (frozenset({'TAX', 'DIS'}), 1),
 (frozenset({'RM', 'INDUS'}), 1),
 (frozenset({'PTRATIO'}), 1),
 (frozenset({'NOX', 'PTRATIO'}), 1),
 (frozenset({'LSTAT', 'CRIM'}), 1),
 (frozenset({'RM'}), 1),
 (frozenset({'TAX', 'PTRATIO'}), 1),
 (frozenset({'NOX'}), 1),
 (frozenset({'DIS', 'CRIM'}), 1),
 (frozenset({'AGE', 'PTRATIO'}), 1),
 (frozenset({'AGE', 'CRIM'}), 1),
 (frozenset({'ZN', 'DIS'}), 1),
 (frozenset({'ZN', 'CRIM'}), 1),
 (frozenset({'CRIM', 'PTRATIO'}), 1),
 (frozenset({'RM', 'CRIM'}), 1)]
[('RM', 0.60067090411997),
 ('LSTAT', 0.22148824141475706),
 ('DIS', 0.068263421165279),
 ('CRIM', 0.03893906506019243),
 ('NOX', 0.028695328014265362),
 ('PTRATIO', 0.014211478583574726),
 ('AGE', 0.012467751974477529),
 ('TAX', 0.011821058983765207),
 ('B', 0.002420619208623876),
 ('INDUS', 0.0008323703650693053),
 ('ZN', 0.00018976111002551332),
 ('CHAS', 0.0),
 ('RAD', 0.0)]

After adding the criterion to exclude "next" nodes that are leaves, the results seem to improve.

Now, one combination of features occurring very frequently is frozenset({'AGE', 'LSTAT'}) - that is, the combination of the age of the building as well as the "% lower status of the population" (whatever that means, presumably a measure of low-income rate). From model.feature_importances_, both LSTAT and AGE are relatively important predictors of sale price, which leads me to believe this combination of features AGE * LSTAT might be useful.

Is this even barking up the right tree (pun maybe intended)? Does counting combinations of sequential features in a given tree speak to potential interactions in a model?

like image 823
blacksite Avatar asked Nov 05 '19 02:11

blacksite


People also ask

What are the 3 types of nodes used in the decision trees?

There are three different types of nodes: chance nodes, decision nodes, and end nodes. A chance node, represented by a circle, shows the probabilities of certain results. A decision node, represented by a square, shows a decision to be made, and an end node shows the final outcome of a decision path.

When we create a decision tree How is the best split determined at each node?

If you recall we made a split on all the available features and then compare each split to decide which one was the best. That is how the decision tree algorithm also works. A Decision Tree first splits the nodes on all the available variables and then selects the split which results in the most homogeneous sub-nodes.

What is the decision tree learning algorithm trying to do at each node in the tree?

The goal of this algorithm is to create a model that predicts the value of a target variable, for which the decision tree uses the tree representation to solve the problem in which the leaf node corresponds to a class label and attributes are represented on the internal node of the tree.

What decides the node of a decision tree?

At each node a variable is evaluated to decide which path to follow. When they are being built decision trees are constructed by recursively evaluating different features and using at each node the feature that best splits the data.


1 Answers

TL;DR : decision tree is not the best tool for analyzing importance of feature combinations.

As any other algorithm, the decision tree (DT) has its weaknesses. The assumption of DT algorithm in its basic form is that the features it works with are uncorrelated. Then growing DT is a process when you choose from the set of all possible questions (decisions) one that split the set of examples in a way that produces maximum gain (accordingly to the chosen loss function, which is usually Gini index or Information gain). If your features are correlated you need either to try to decorrelate them (by applying PCA for example) or throw away some in a smart way (procedure called feature selection), otherwise it may cause either bad generalization or too many small leafs. You can read here more about it.

Another issue with DT is that it was designed to work with categorical data, and we make it work with numerical data by applying binning to the data. So on certain features the shear amount of questions might be much higher, than on other features.

That said, after your DT is ready, you can understand importance of each decision (data being in a certain range of values): the closer the decision is to the root of the tree, the more important it is. So location is also important, and amount of times certain combination of features appears in the tree does not directly indicate importance of that combination. And while some meaningful combinations might surface up, their amount not necessarily will be high enough to stand out of the crowd.

like image 128
igrinis Avatar answered Sep 25 '22 13:09

igrinis