Dealing with the class imbalance in binary classification

Question

Here's a brief description of my problem:

I am working on a supervised learning task to train a binary classifier.
I have a dataset with a large class imbalance distribution: 8 negative instances every one positive.
I use the f-measure, i.e. the harmonic mean between specificity and sensitivity, to assess the performance of a classifier.

I plot the ROC graphs of several classifiers and all present a great AUC, meaning that the classification is good. However, when I test the classifier and compute the f-measure I get a really low value. I know that this issue is caused by the class skewness of the dataset and, by now, I discover two options to deal with it:

Adopting a cost-sensitive approach by assigning weights to the dataset's instances (see this post)
Thresholding the predicted probabilities returned by the classifiers, to reduce the number of false positives and false negatives.

I went for the first option and that solved my issue (f-measure is satisfactory). BUT, now, my question is: which of these methods is preferable? And what are the differences?

P.S: I am using Python with the scikit-learn library.

cdeterman · Accepted Answer

Both weighting (cost-sensitive) and thresholding are valid forms of cost-sensitive learning. In the briefest terms, you can think of the two as follows:

Weighting

Essentially one is asserting that the ‘cost’ of misclassifying the rare class is worse than misclassifying the common class. This is applied at the algorithmic level in such algorithms as SVM, ANN, and Random Forest. The limitations here consist of whether the algorithm can deal with weights. Furthermore, many applications of this are trying to address the idea of making a more serious misclassification (e.g. classifying someone who has pancreatic cancer as non having cancer). In such circumstances, you know why you want to make sure you classify specific classes even in imbalanced settings. Ideally you want to optimize the cost parameters as you would any other model parameter.

Thresholding

If the algorithm returns probabilities (or some other score), thresholding can be applied after a model has been built. Essentially you change the classification threshold from 50-50 to an appropriate trade-off level. This typically can be optimized by generated a curve of the evaluation metric (e.g. F-measure). The limitation here is that you are making absolute trade-offs. Any modification in the cutoff will in turn decrease the accuracy of predicting the other class. If you have exceedingly high probabilities for the majority of your common classes (e.g. most above 0.85) you are more likely to have success with this method. It is also algorithm independent (provided the algorithm returns probabilities).

Sampling

Sampling is another common option applied to imbalanced datasets to bring some balance to the class distributions. There are essentially two fundamental approaches.

Under-sampling

Extract a smaller set of the majority instances and keep the minority. This will result in a smaller dataset where the distribution between classes is closer; however, you have discarded data that may have been valuable. This could also be beneficial if you have a very large amount of data.

Over-sampling

Increase the number of minority instances by replicating them. This will result in a larger dataset which retains all the original data but may introduce bias. As you increase the size, however, you may begin to impact computational performance as well.

Advanced Methods

There are additional methods that are more ‘sophisticated’ to help address potential bias. These include methods such as SMOTE, SMOTEBoost and EasyEnsemble as referenced in this prior question regarding imbalanced datasets and CSL.

Model Building

One further note regarding building models with imbalanced data is that you should keep in mind your model metric. For example, metrics such as F-measures don’t take into account the true negative rate. Therefore, it is often recommended that in imbalanced settings to use metrics such as Cohen’s kappa metric.

Noam Bressler · Answer

Before trying to solve the problem (and I think @cdeterman's answer covers that thoroughly), it's best to first define measures.

Apart from "all-in-one" metrics like Cohen's kappa, I find it extremely useful to just compute common metrics (such as precision, recall and f-measure) per each of the classes in the problem. Scikit-learn's classification_report does that quite conveniently:

from sklearn.metrics import classification_report
print(classification_report(test_df['target'], model.predict(test_df[features])))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      2640
           1       0.94      0.73      0.82        84

    accuracy                           0.99      2724
   macro avg       0.96      0.86      0.91      2724
weighted avg       0.99      0.99      0.99      2724

If you want a more visual output, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers) :

from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='target'), Dataset(test_df, label='target'), model)

Using such per-class metrics would have alerted you from the very beginning that your model is under-performing on certain classes (and on which ones). Running it again after using some cost-sensitive learning would let you know if you managed to balance out your performance between classes.

Dealing with the class imbalance in binary classification

Tags:

python

r

machine-learning

classification

blueSurfer

2 Answers

Weighting

Thresholding

Sampling

Model Building

cdeterman

Noam Bressler

Recent Activity

Donate For Us

Dealing with the class imbalance in binary classification

Tags:

python

r

machine-learning

classification

blueSurfer

2 Answers

Weighting

Thresholding

Sampling

Model Building

cdeterman

Noam Bressler

Related questions

Recent Activity

Donate For Us