I am using <code>RandomForestClassifier()</code> with <code>10 fold cross validation</code> as follows. <pre class="prettyprint"><code>clf=RandomForestClassifier(random_state = 42, class_weight="balanced") k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy') print(accuracy.mean()) </code></pre> I want to identify the important features in my feature space. It seems to be straightforward to get the feature importance for single classification as follows. <pre class="prettyprint"><code>print("Features sorted by their score:") feature_importances = pd.DataFrame(clf.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False) print(feature_importances) </code></pre> However, I could not find how to perform <code>feature importance</code> for <code>cross validation</code> in sklearn. In summary, I want to identify the most effective features (e.g., by using an <code>average importance score</code>) in the 10-folds of cross validation. I am happy to provide more details if needed.

<code>cross_val_score()</code> does not return the estimators for each combination of train-test folds. You need to use <code>cross_validate()</code> and set <code>return_estimator =True</code>. Here is an working example: <pre class="prettyprint"><code>from sklearn import datasets from sklearn.model_selection import cross_validate from sklearn.svm import LinearSVC from sklearn.ensemble import RandomForestClassifier import pandas as pd diabetes = datasets.load_diabetes() X, y = diabetes.data, diabetes.target clf=RandomForestClassifier(n_estimators =10, random_state = 42, class_weight="balanced") output = cross_validate(clf, X, y, cv=2, scoring = 'accuracy', return_estimator =True) </code></pre> <pre class="prettyprint"><code>for idx,estimator in enumerate(output['estimator']): print("Features sorted by their score for estimator {}:".format(idx)) feature_importances = pd.DataFrame(estimator.feature_importances_, index = diabetes.feature_names, columns=['importance']).sort_values('importance', ascending=False) print(feature_importances) </code></pre> Output: <pre class="prettyprint"><code>Features sorted by their score for estimator 0: importance s6 0.137735 age 0.130152 s5 0.114561 s2 0.113683 s3 0.112952 bmi 0.111057 bp 0.108682 s1 0.090763 s4 0.056805 sex 0.023609 Features sorted by their score for estimator 1: importance age 0.129671 bmi 0.125706 s2 0.125304 s1 0.113903 bp 0.111979 s6 0.110505 s5 0.106099 s3 0.098392 s4 0.054542 sex 0.023900 </code></pre>

How to calculate feature importance in each models of cross validation in sklearn

Tags:

python

machine-learning

classification

scikit-learn

cross-validation

I am using RandomForestClassifier() with 10 fold cross validation as follows.

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print(accuracy.mean())

I want to identify the important features in my feature space. It seems to be straightforward to get the feature importance for single classification as follows.

print("Features sorted by their score:")
feature_importances = pd.DataFrame(clf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)

However, I could not find how to perform feature importance for cross validation in sklearn.

In summary, I want to identify the most effective features (e.g., by using an average importance score) in the 10-folds of cross validation.

I am happy to provide more details if needed.

616

asked Apr 02 '19 02:04

EmJ

1 Answers

cross_val_score() does not return the estimators for each combination of train-test folds.

You need to use cross_validate() and set return_estimator =True.

Here is an working example:

from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.svm import LinearSVC
from sklearn.ensemble import  RandomForestClassifier
import pandas as pd

diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target

clf=RandomForestClassifier(n_estimators =10, random_state = 42, class_weight="balanced")
output = cross_validate(clf, X, y, cv=2, scoring = 'accuracy', return_estimator =True)

for idx,estimator in enumerate(output['estimator']):
    print("Features sorted by their score for estimator {}:".format(idx))
    feature_importances = pd.DataFrame(estimator.feature_importances_,
                                       index = diabetes.feature_names,
                                        columns=['importance']).sort_values('importance', ascending=False)
    print(feature_importances)

Output:

Features sorted by their score for estimator 0:
     importance
s6     0.137735
age    0.130152
s5     0.114561
s2     0.113683
s3     0.112952
bmi    0.111057
bp     0.108682
s1     0.090763
s4     0.056805
sex    0.023609
Features sorted by their score for estimator 1:
     importance
age    0.129671
bmi    0.125706
s2     0.125304
s1     0.113903
bp     0.111979
s6     0.110505
s5     0.106099
s3     0.098392
s4     0.054542
sex    0.023900

134

answered Sep 28 '22 00:09

Venkatachalam

Related questions
                            
                                Optical Character Recognition Multiple Line Detection
                            
                                how to fix the flake 8 error "E712 comparison to False should be 'if cond is False:' or 'if not cond:'" in pandas dataframe
                            
                                How to use numpy einsum_path result?
                            
                                Binning continuous values with round() creates artifacts
                            
                                Django 'SessionStore' object has no attribute '_session_cache'
                            
                                How to create a Dataflow pipeline from Pub/Sub to GCS in Python
                            
                                Export data from Python to Tableau directly
                            
                                Celery task cannot be called (missing positional arguments) from Django app
                            
                                pyspark: Method isBarrier([]) does not exist
                            
                                Error in loading the model with load_weights in Keras
                            
                                How to POST the refresh token to Flask JWT Extended?
                            
                                How does thread pooling works, and how to implement it in an async/await env like NodeJS?
                            
                                Virtualenv python with AWS codebuild: why the deactivate command is not found?
                            
                                Is there a way to log number of retries made
                            
                                dylib cannot load libstd when compiled in a workspace
                            
                                Does cython support dataclasses or something similar
                            
                                Why does python implementation use 9 times more memory than C?
                            
                                TensorFlow 2.0 Keras: How to write image summaries for TensorBoard
                            
                                Inner most dimension of an Array
                            
                                Is there a way to change this nested loop into a recursive loop?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With