Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate feature importance in each models of cross validation in sklearn

I am using RandomForestClassifier() with 10 fold cross validation as follows.

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print(accuracy.mean())

I want to identify the important features in my feature space. It seems to be straightforward to get the feature importance for single classification as follows.

print("Features sorted by their score:")
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)

However, I could not find how to perform feature importance for cross validation in sklearn.

In summary, I want to identify the most effective features (e.g., by using an average importance score) in the 10-folds of cross validation.

I am happy to provide more details if needed.

like image 616
EmJ Avatar asked Apr 02 '19 02:04

EmJ


People also ask

How is feature importance calculated in Sklearn?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

How do you measure a feature important?

The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.

How do models like random forest determine feature importance?

Random Forest Built-in Feature Importance It is a set of Decision Trees. Each Decision Tree is a set of internal nodes and leaves. In the internal node, the selected feature is used to make decision how to divide the data set into two separate sets with similars responses within.

How do you determine the feature important in a decision tree?

How do we Compute feature importance from decision trees? The basic idea for computing the feature importance for a specific feature involves computing the impurity metric of the node subtracting the impurity metric of any child nodes.


1 Answers

cross_val_score() does not return the estimators for each combination of train-test folds.

You need to use cross_validate() and set return_estimator =True.

Here is an working example:

from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.svm import LinearSVC
from sklearn.ensemble import  RandomForestClassifier
import pandas as pd

diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target

clf=RandomForestClassifier(n_estimators =10, random_state = 42, class_weight="balanced")
output = cross_validate(clf, X, y, cv=2, scoring = 'accuracy', return_estimator =True)
for idx,estimator in enumerate(output['estimator']):
    print("Features sorted by their score for estimator {}:".format(idx))
    feature_importances = pd.DataFrame(estimator.feature_importances_,
                                       index = diabetes.feature_names,
                                        columns=['importance']).sort_values('importance', ascending=False)
    print(feature_importances)

Output:

Features sorted by their score for estimator 0:
     importance
s6     0.137735
age    0.130152
s5     0.114561
s2     0.113683
s3     0.112952
bmi    0.111057
bp     0.108682
s1     0.090763
s4     0.056805
sex    0.023609
Features sorted by their score for estimator 1:
     importance
age    0.129671
bmi    0.125706
s2     0.125304
s1     0.113903
bp     0.111979
s6     0.110505
s5     0.106099
s3     0.098392
s4     0.054542
sex    0.023900
like image 134
Venkatachalam Avatar answered Sep 28 '22 00:09

Venkatachalam