Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I show feature importance for MultiOutputClassifier?

I'm trying to recover the feature importance of a multioutput Classifier using a RandomForest.

The MultiOutput model does not show any problems:

import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

## Generate data
x, y = make_multilabel_classification(n_samples=1000, 
                                      n_features=15, 
                                      n_labels = 5, 
                                      n_classes=3, 
                                      random_state=12, 
                                      allow_unlabeled = True)
x_train = x[:700,:]
x_test  = x[701:,:]
y_train = y[:700,:]
y_test  = y[701:,:]

## Generate model
forest = RandomForestClassifier(n_estimators = 100, random_state = 1)
multi_forest = MultiOutputClassifier(forest, n_jobs = -1).fit(x_train, y_train)

## Make prediction
dfOutput_multi_forest = multi_forest.predict_proba(x_test)

The prediction dfOutput_multi_forest does not show any problems, but I want to recover the feature importance of the multi_forest for interpretation of the output.

Using multi_forest.feature_importance_ throws the following error message: AttributeError: 'MultiOutputClassifier' object has no attribute 'feature_importance_'

Does anyone know how to retrieve the feature importance? I'm using scikit v0.20.2

like image 200
PaulH Avatar asked Feb 06 '19 20:02

PaulH


People also ask

What is a multi output classifier?

class sklearn.multioutput. MultiOutputClassifier(estimator, *, n_jobs=None) [source] ¶. This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.

How do you use a multi-output model to predict multiple variables?

Predict multi-output variable using model for each target variable. Return prediction probabilities for each class of each output. Return the mean accuracy on the given test data and labels. Set the parameters of this estimator. Fit the model to data matrix X and targets Y. The input data. The target values. Sample weights.

How to extend classifiers that do not natively support multi-target classification?

This is a simple strategy for extending classifiers that do not natively support multi-target classification. An estimator object implementing fit, score and predict_proba. The number of jobs to run in parallel. fit, predict and partial_fit (if supported by the passed estimator) will be parallelized for each target.

Does multioutputregressor have estimators for RFR?

That (for some reason) didn't throw an exception but, after running the code below, it says that object MultiOutputRegressor does not have estimators, but it does for RFR. If I try to access the original RFR in my model list I get the exception 'tuple has no attribute feature importances'


1 Answers

Indeed, it doesn't appear that Sklearn's MultiOutputClassifier has an attribute that contains some sort of amalgamation of the feature importances of all the estimators (in your case, all the RandomForest classifiers) used in the model.

However, it is possible to access the feature importances of each RandomForest classifier, and then average them all together to give you each feature's average importance, across all RandomForest classifiers.

MultiOutputClassifier objects have an attribute called estimators_. If you run multi_forest.estimators_, you will get a list containing an object for each of your RandomForest classifiers.

For each of these RandomForest classifier objects, you can access its feature importances through the feature_importances_ attribute.

To put it all together, this was my approach:

feat_impts = [] 
for clf in multi_forest.estimators_:
    feat_impts.append(clf.feature_importances_)

np.mean(feat_impts, axis=0)

I ran the example code you pasted into your question, and then ran the above block of code to output a list of the following 15 averages:

array([0.09830467, 0.0912088 , 0.05738045, 0.1211305 , 0.03901933,
       0.05429491, 0.06929378, 0.06404416, 0.05676634, 0.04919717,
       0.05244265, 0.0509295 , 0.05615341, 0.09202444, 0.04780991])

Which contains the average importance of each of your 15 features, across each of the 3 random forest classifiers used in your MultiOutputClassifier.

This should at least help you to see which features, on the whole, tended to be more important in making predictions for each of your 3 classes.

like image 51
James Dellinger Avatar answered Oct 07 '22 11:10

James Dellinger