Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform cross validation for imbalanced datasets in sklearn

I have a highly imbalanced dataset and I want to perform a binary classification.

When reading some posts I found that sklearn provides class_weight="balanced" for imbalanced datasets. So, my classifier code is as follows.

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")

Then I performed 10 fold cross validation as follows using the above classifier.

k_fold = KFold(n_splits=10, shuffle=True, random_state=42)
new_scores = cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
print(new_scores.mean())

However, I am not sure if class_weight="balanced" is reflected through 10-fold cross validation. Am I doing it wrong? If so, is there any better way of doing this in sklearn?

I am happy to provide more details if needed.

like image 891
EmJ Avatar asked Mar 30 '19 20:03

EmJ


1 Answers

Instead of general cross validation, you might want to use stratified cross validation. More, specifically, you can use StratifiedKFold. instead of KFold in your code.

This makes sures, that the class imbalances are captured by all potential train and test splits.

like image 125
Quickbeam2k1 Avatar answered Oct 21 '22 22:10

Quickbeam2k1