For a project I am comparing a number of decision trees, using the regression algorithms (Random Forest, Extra Trees, Adaboost and Bagging) of scikit-learn. To compare and interpret them I use the feature importance , though for the bagging decision tree this does not look to be available.
My question: Does anybody know how to get the feature importances list for Bagging?
Greetings, Kornee
The higher the value the more important the feature. For each decision tree, Scikit-learn calculates a nodes importance using Gini Importance, assuming only two child nodes (binary tree): ni sub(j)= the importance of node j. w sub(j) = weighted number of samples reaching node j.
Random forest is a supervised machine learning algorithm based on ensemble learning and an evolution of Breiman's original bagging algorithm. It's a great improvement over bagged decision trees in order to build multiple decision trees and aggregate them to get an accurate result.
Logistic Regression Feature ImportanceThese coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.
Are you talking about BaggingClassifier? It can be used with many base estimators, so there is no feature importances implemented. There are model-independent methods for computing feature importances (see e.g. https://github.com/scikit-learn/scikit-learn/issues/8898), scikit-learn doesn't use them.
In case of decision trees as base estimators you can compute feature importances yourselves: it'd be just an average of tree.feature_importances_
among all trees in bagging.estimators_
:
import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
clf = BaggingClassifier(DecisionTreeClassifier())
clf.fit(X, y)
feature_importances = np.mean([
tree.feature_importances_ for tree in clf.estimators_
], axis=0)
RandomForestClassifer does the same computation internally.
Extending what CharlesG posted, here's my solution for overloading the BaggingRegressor (same should work for BaggingClassifier).
class myBaggingRegressor(BaggingRegressor):
def fit(self, X, y):
fitd = super().fit(X, y)
# need to pad features?
if self.max_features == 1.0:
# compute feature importances or coefficients
if hasattr(fitd.estimators_[0], 'feature_importances_'):
self.feature_importances_ = np.mean([est.feature_importances_ for est in fitd.estimators_], axis=0)
else:
self.coef_ = np.mean([est.coef_ for est in fitd.estimators_], axis=0)
self.intercept_ = np.mean([est.intercept_ for est in fitd.estimators_], axis=0)
else:
# need to process results into the right shape
coefsImports = np.empty(shape=(self.n_features_, self.n_estimators), dtype=float)
coefsImports.fill(np.nan)
if hasattr(fitd.estimators_[0], 'feature_importances_'):
# store the feature importances
for idx, thisEstim in enumerate(fitd.estimators_):
coefsImports[fitd.estimators_features_[idx], idx] = thisEstim.feature_importances_
# compute average
self.feature_importances_ = np.nanmean(coefsImports, axis=1)
else:
# store the coefficients & intercepts
self.intercept_ = 0
for idx, thisEstim in enumerate(fitd.estimators_):
coefsImports[fitd.estimators_features_[idx], idx] = thisEstim.coefs_
self.intercept += thisEstim.intercept_
# compute
self.intercept /= self.n_estimators
# average
self.coefs_ = np.mean(coefsImports, axis=1)
return fitd
This correctly handles if max_features <> 1.0
, though I suppose won't work exactly if bootstrap_features=True
.
I suppose it's because sklearn has evolved a lot since 2017, but couldn't get it to work with the constructor, and it doesn't seem entirely necessary - the only reason to have that is to pre-specify the feature_importances_ attribute as None. However, it shouldn't even exist until fit() is called anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With