Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature importances - Bagging, scikit-learn

For a project I am comparing a number of decision trees, using the regression algorithms (Random Forest, Extra Trees, Adaboost and Bagging) of scikit-learn. To compare and interpret them I use the feature importance , though for the bagging decision tree this does not look to be available.

My question: Does anybody know how to get the feature importances list for Bagging?

Greetings, Kornee

like image 239
Kornee Avatar asked Jun 02 '17 16:06

Kornee


People also ask

How does Scikit learn calculate feature importance?

The higher the value the more important the feature. For each decision tree, Scikit-learn calculates a nodes importance using Gini Importance, assuming only two child nodes (binary tree): ni sub(j)= the importance of node j. w sub(j) = weighted number of samples reaching node j.

Is bagging better than random forest?

Random forest is a supervised machine learning algorithm based on ensemble learning and an evolution of Breiman's original bagging algorithm. It's a great improvement over bagged decision trees in order to build multiple decision trees and aggregate them to get an accurate result.

Does logistic regression give feature importance?

Logistic Regression Feature ImportanceThese coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.


2 Answers

Are you talking about BaggingClassifier? It can be used with many base estimators, so there is no feature importances implemented. There are model-independent methods for computing feature importances (see e.g. https://github.com/scikit-learn/scikit-learn/issues/8898), scikit-learn doesn't use them.

In case of decision trees as base estimators you can compute feature importances yourselves: it'd be just an average of tree.feature_importances_ among all trees in bagging.estimators_:

import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = BaggingClassifier(DecisionTreeClassifier())
clf.fit(X, y)

feature_importances = np.mean([
    tree.feature_importances_ for tree in clf.estimators_
], axis=0)

RandomForestClassifer does the same computation internally.

like image 92
Mikhail Korobov Avatar answered Sep 29 '22 00:09

Mikhail Korobov


Extending what CharlesG posted, here's my solution for overloading the BaggingRegressor (same should work for BaggingClassifier).

class myBaggingRegressor(BaggingRegressor):
    def fit(self, X, y):
        fitd = super().fit(X, y)
        # need to pad features?
        if self.max_features == 1.0:
            # compute feature importances or coefficients
            if hasattr(fitd.estimators_[0], 'feature_importances_'):
                self.feature_importances_ =  np.mean([est.feature_importances_ for est in fitd.estimators_], axis=0)
            else:
                self.coef_ =  np.mean([est.coef_ for est in fitd.estimators_], axis=0)
                self.intercept_ =  np.mean([est.intercept_ for est in fitd.estimators_], axis=0)
        else:
            # need to process results into the right shape
            coefsImports = np.empty(shape=(self.n_features_, self.n_estimators), dtype=float)
            coefsImports.fill(np.nan)
            if hasattr(fitd.estimators_[0], 'feature_importances_'):
                # store the feature importances
                for idx, thisEstim in enumerate(fitd.estimators_):
                    coefsImports[fitd.estimators_features_[idx], idx] = thisEstim.feature_importances_
                # compute average
                self.feature_importances_ = np.nanmean(coefsImports, axis=1)
            else:
                # store the coefficients & intercepts
                self.intercept_ = 0
                for idx, thisEstim in enumerate(fitd.estimators_):
                    coefsImports[fitd.estimators_features_[idx], idx] = thisEstim.coefs_
                    self.intercept += thisEstim.intercept_
                # compute
                self.intercept /= self.n_estimators
                # average
                self.coefs_ = np.mean(coefsImports, axis=1)                
        return fitd

This correctly handles if max_features <> 1.0, though I suppose won't work exactly if bootstrap_features=True.

I suppose it's because sklearn has evolved a lot since 2017, but couldn't get it to work with the constructor, and it doesn't seem entirely necessary - the only reason to have that is to pre-specify the feature_importances_ attribute as None. However, it shouldn't even exist until fit() is called anyway.

like image 44
Dr. Andrew Avatar answered Sep 29 '22 00:09

Dr. Andrew