I am using RandomForestClassifier()
with 10 fold cross validation
as follows.
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print(accuracy.mean())
I want to identify the important features in my feature space. It seems to be straightforward to get the feature importance for single classification as follows.
print("Features sorted by their score:")
feature_importances = pd.DataFrame(clf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
However, I could not find how to perform feature importance
for cross validation
in sklearn.
In summary, I want to identify the most effective features (e.g., by using an average importance score
) in the 10-folds of cross validation.
I am happy to provide more details if needed.
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.
Random Forest Built-in Feature Importance It is a set of Decision Trees. Each Decision Tree is a set of internal nodes and leaves. In the internal node, the selected feature is used to make decision how to divide the data set into two separate sets with similars responses within.
How do we Compute feature importance from decision trees? The basic idea for computing the feature importance for a specific feature involves computing the impurity metric of the node subtracting the impurity metric of any child nodes.
cross_val_score()
does not return the estimators for each combination of train-test folds.
You need to use cross_validate()
and set return_estimator =True
.
Here is an working example:
from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target
clf=RandomForestClassifier(n_estimators =10, random_state = 42, class_weight="balanced")
output = cross_validate(clf, X, y, cv=2, scoring = 'accuracy', return_estimator =True)
for idx,estimator in enumerate(output['estimator']):
print("Features sorted by their score for estimator {}:".format(idx))
feature_importances = pd.DataFrame(estimator.feature_importances_,
index = diabetes.feature_names,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
Output:
Features sorted by their score for estimator 0:
importance
s6 0.137735
age 0.130152
s5 0.114561
s2 0.113683
s3 0.112952
bmi 0.111057
bp 0.108682
s1 0.090763
s4 0.056805
sex 0.023609
Features sorted by their score for estimator 1:
importance
age 0.129671
bmi 0.125706
s2 0.125304
s1 0.113903
bp 0.111979
s6 0.110505
s5 0.106099
s3 0.098392
s4 0.054542
sex 0.023900
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With