Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why classification models don't work on class imbalanced setting?

There are many posts and resources on how to combat a class imbalance problem, namely over-sampling the minority class or under-sampling the majority class.

I also understand that using accuracy to evaluate the model performance on an imbalanced problem would be wrong.

However, I didn't find many resources talking about why ML models fail in class imbalanced problems in the first place. Is it simply because the loss function usually is the sum of all the data points, so a model will tend to put more emphasis on a majority class data and not on a minority class data?

Second, in real applications, such as a fraud detection or a click prediction ( where class imbalances happen ), why would changing the distribution by over(under)-sampling of training set be a good thing to do ? Wouldn't we want the classifier to reflect the real distribution (which is imbalanced in its nature) ? Let's say I have a logistic regression model trained to predict fraud and let's assume that the fraud rate is 2%. Over-sampling the fraud events essentially tells the model that the fraud rate is not 2%, but 50% (say). Is that a good thing to do ?

To summarize. Two questions:

  1. Why would ML models fail in class imbalanced setting? Is it because of the loss function usually is composed of sum of losses of individual data points?

  2. Why is the over(under)-sampling, which essentially changes how the model sees the problem, a good way? Why not let the model reflect truthfully the distribution of the classes ?

like image 652
Jing Avatar asked Feb 07 '18 21:02

Jing


2 Answers

TL;DR: the "curse" of class imbalance is kind of a myth, relevant only for certain types of problems.

  1. Not all ML models fail in class imbalance setting. Most of probabilistic models are not seriously affected by class imbalance. The problems usually arise when we switch to non-probabilistic or multiclass prediction.

In logistic regression (and its generalization - neural networks), class imbalance strongly affects intercept, but has very small influence on the slope coefficients. Intuitively, the predicted odds ratio log(p(y=1)/p(y=0)) = a+x*b from binary logistic regression changes by a fixed amount when we change prior probabilities of classes, and this effect is caught by the intercept a.

In decision tree (and its generalization - random forest and gradient boosted trees), class imbalance affects leaf impurity metrics, but this effect is roughly equal for all candidate splits, so it usually does not affect the choice of splits much (the details).

On the other hand, non-probabilistic models like SVM can be seriously affected by class imbalance. SVM learns its separating hyperplane in such a way that roughly the same number of positive and negative examples (support observations) lie on the border or on its wrong side. Therefore, resampling can dramatically change these numbers and the position of the border.

When we use probabilistic models for binary classification, everything is OK: at training time, models don't depend much on imbalance, and for testing we can use imbalance-insensitive metrics like ROC AUC, which depend on predicted class probability, instead of "hard" discrete classification.

However, these metrics do not easily generalize for multiclass classification, and we usually exploit simple accuracy to evaluate multiclass problems. And accuracy has known issues with class imbalance: it is based on hard classification, which may completely ignore the rare classes. This is the case when most practitioners turn to oversampling. However, if you stick to probabilistic prediction and measure performance with log loss (aka cross-entropy), you can still survive class imbalance.

  1. Over-sampling is good when you don't want probabilistic classification. In this case, "distribution of the classes" is kind of irrelevant.

Imagine an application when you don't need the probability that there is a cat on the picture, you just want to know whether this image is more similar to images of cats than to images of dogs. In this settings, it may be desired that cats and dogs have equal number of "votes", even if in the original training sample cats were the majority.

In other applications (like fraud detection, click prediction, or my favorite credit scoring), what you really need is not "hard" classification, but ranking: which customers are more likely to cheat, click, or default, that the others? In this case, it is not important whether the sample is imbalanced, because the cutoff is usually set by hand (from economic considerations, such as cost analysis). However, in such applications it may be helpful to predict the "true" probability of fraud (or click, or default), and upsampling is thus unwanted.

like image 121
David Dale Avatar answered Oct 15 '22 13:10

David Dale


I explain your questions by a example:

If there is a dataset consisting of following transactions:

  1. 10000 genuine
  2. 10 fraudulent

The classifier will tend to classify fraudulent transactions as genuine transactions.

Suppose the machine learning algorithm has two possibly outputs as follows:

Model 1

Classified transactions:

  1. 10 out of 10000 genuine as fraudulent
  2. 7 out of 10 fraudulent as genuine

Model 2

Classified transactions:

  1. 100 out of 10000 genuine as fraudulent
  2. 2 out of 10 fraudulent as genuine

If the classifier’s performance is determined by the number of mistakes:

Model 1: 17 mistakes.

Model 2: 102 mistakes.

Model 1 is better.

However, as we want to minimize the number of fraudulent happening:

Model 1: 7 mistakes.

Model 2: 2 mistakes.

Model 2 is better.

Answer question 1: A general machine learning algorithm will just pick Model 1 than Model 2, which is a problem. In practice, this means we will let a lot of fraudulent transactions go through although we could have stopped them by using Model 2.

Undersampling

By undersampling, we could risk removing some of the majority class instances which is more representative, thus discarding useful information.

This can be illustrated as follows:

enter image description here

Green line: ideal decision boundary we would like to have. Blue is the actual result.

Left diagram: result of just applying a general machine learning algorithm without undersampling.

Right diagram: result of just applying a general machine learning algorithm with undersampling.

Oversampling

By oversampling, just duplicating the minority classes could lead the classifier to overfitting to a few examples, which can be illustrated below:

enter image description here

Left diagram: result of just applying a general machine learning algorithm without undersampling.

Right diagram: result of just applying a general machine learning algorithm with oversampling.

Answer question 2: In undersampling, we undersampled the negative class but removed some informative negative class, and caused the blue decision boundary to be slanted, causing some negative class to be classified as positive class wrongly.

In oversampling, the thick positive signs indicate there are multiple repeated copies of that data instance. The machine learning algorithm then sees these cases many times and thus designs to overfit to these examples specifically, resulting in a blue line boundary as above.

For more information please see this.

like image 5
Ali Soltani Avatar answered Oct 15 '22 13:10

Ali Soltani