How to perform cross validation for imbalanced datasets in sklearn

Question

I have a highly imbalanced dataset and I want to perform a binary classification.

When reading some posts I found that sklearn provides class_weight="balanced" for imbalanced datasets. So, my classifier code is as follows.

clf=RandomForestClassifier(random_state = 42, class_weight="balanced")

Then I performed 10 fold cross validation as follows using the above classifier.

k_fold = KFold(n_splits=10, shuffle=True, random_state=42)
new_scores = cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
print(new_scores.mean())

However, I am not sure if class_weight="balanced" is reflected through 10-fold cross validation. Am I doing it wrong? If so, is there any better way of doing this in sklearn?

I am happy to provide more details if needed.

Quickbeam2k1 · Accepted Answer

Instead of general cross validation, you might want to use stratified cross validation. More, specifically, you can use StratifiedKFold. instead of KFold in your code.

This makes sures, that the class imbalances are captured by all potential train and test splits.

How to perform cross validation for imbalanced datasets in sklearn

Tags:

python

machine-learning

classification

scikit-learn

EmJ

1 Answers

Quickbeam2k1

Recent Activity

Donate For Us

How to perform cross validation for imbalanced datasets in sklearn

Tags:

python

machine-learning

classification

scikit-learn

EmJ

1 Answers

Quickbeam2k1

Related questions

Recent Activity

Donate For Us