I have a highly imbalanced dataset and I want to perform a binary classification.
When reading some posts I found that sklearn
provides class_weight="balanced"
for imbalanced datasets. So, my classifier code is as follows.
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
Then I performed 10 fold cross validation as follows using the above classifier.
k_fold = KFold(n_splits=10, shuffle=True, random_state=42)
new_scores = cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
print(new_scores.mean())
However, I am not sure if class_weight="balanced"
is reflected through 10-fold cross validation. Am I doing it wrong? If so, is there any better way of doing this in sklearn?
I am happy to provide more details if needed.
Instead of general cross validation, you might want to use stratified cross validation. More, specifically, you can use StratifiedKFold
.
instead of KFold
in your code.
This makes sures, that the class imbalances are captured by all potential train and test splits.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With