k-fold stratified cross-validation with imbalanced classes

Tags:

I have data with 4 classes and I am trying to build a classifier. I have ~1000 vectors for one class, ~10^4 for another, ~10^5 for the third and ~10^6 for the fourth. I was hoping to use cross-validation so I looked at the scikit-learn docs .

My first try was to use StratifiedShuffleSplit but this gives the same percentage for each class, leaving the classes drastically imbalanced still.

Is there a way to do cross-validation but with the classes balanced in the training and test set?

As a side note, I couldn't work out the difference between StratifiedShuffleSplit and StratifiedKFold . The descriptions look very similar to me.

561

asked Sep 16 '15 17:09

graffe

2 Answers

My first try was to use StratifiedShuffleSplit but this gives the same percentage for each class, leaving the classes drastically imbalanced still.

I get the feeling that you're confusing what a stratified strategy will do, but you'll need to show your code and your results to say for sure what's going on (the same percentage as their percentage in the original set, or the same percentage within the returned train / test set? The first one is how it's supposed to be).

As a side note, I couldn't work out the difference between StratifiedShuffleSplit and StratifiedKFold . The descriptions look very similar to me.

One of these should definitely work. The description of the first one is definitely a little confusing, but here's what they do.

StratifiedShuffleSplit

Provides train/test indices to split data in train test sets.

This means that it splits your data into a train and test set. The stratified part means that percentages will be maintained in this split. So if 10% of your data is in class 1 and 90% is in class 2, this will ensure that 10% of your train set will be in class 1 and 90% will be in class 2. Same for the test set.

Your post makes it sound like you'd want 50% of each class in the test set. That isn't what stratification does, stratification maintains the original percentages. You should maintain them, because otherwise you'll give yourself an irrelevant idea about the performance of your classifier: who cares how well it classified a 50/50 split, when in practice you'll see 10/90 splits?

StratifiedKFold

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

See k-fold cross validation. Without stratification, it just splits your data into k folds. Then, each fold 1 <= i <= k is used once as the test set, while the others are used for training. The results are averaged in the end. It's similar to running the ShuffleSplit k times.

Stratification will ensure that the percentages of each class in your entire data will be the same (or very close to) within each individual fold.

There is a lot of literature that deals with imbalanced classes. Some simple to use methods involve using class weights and analysis the ROC curve. I suggest the following resources for starting points on this:

A scikit-learn example of using class weights.
A quora question about implementing neural networks for imbalanced data.
This stats.stackexchange question with more in-depth answers.

108

answered Oct 28 '22 03:10

IVlad

K-Fold CV

K-Fold CV works by randomly partitioning your data into k (fairly) equal partitions. If your data were evenly balanced across classes like [0,1,0,1,0,1,0,1,0,1], randomly sampling with (or without replacement) will give you approximately eqal sample sizes of 0 and 1.

However, if your data is more like [0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0] where one class over represents the data, k-fold cv without weighted sampling would give you erroneous results.

If you use ordinary k-fold CV without adjusting sampling weights from uniform sampling, then you'd obtain something like

## k-fold CV
k = 5
splits = np.array_split(y, k)
for i in range(k):
    print(np.mean(splits[i]))

 [array([0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0]),
 array([0, 1, 1, 1, 1, 1])]

where there are clearly splits without useful representation of both classes.

The point of k-fold CV is to train/test a model across all subsets of data, while at each trial leaving out 1 subset and training on k-1 subsets.

In this scenario, you'd want to use split by strata. In the above data set, there are 27 0s and 5 1s. If you'd like to compute k=5 CV, it wouldn't be reasonable to split the strata of 1 into 5 subsets. A better solution is to split it into k < 5 subsets, such as 2. The strata of 0s can remain with k=5 splits since it's much larger. Then while training, you'd have a simple product of 2 x 5 from the data set. Here is some code to illustrate

from itertools import product

for strata, iterable in groupby(y):
    data = np.array(list(iterable))
    if strata == 0:
        zeros = np.array_split(data, 5)
    else:
        ones = np.array_split(data, 2)


cv_splits = list(product(zeros, ones))
print(cv_splits)

m = len(cv_splits)
for i in range(2):
    for j in range(5):
        data = np.concatenate((ones[-i+1], zeros[-j+1]))
        print("Leave out ONES split {}, and Leave out ZEROS split {}".format(i,j))
        print("train on: ", data)
        print("test on: ", np.concatenate((ones[i], zeros[j])))



Leave out ONES split 0, and Leave out ZEROS split 0
train on:  [1 1 0 0 0 0 0 0]
test on:  [1 1 1 0 0 0 0 0 0]
Leave out ONES split 0, and Leave out ZEROS split 1
train on:  [1 1 0 0 0 0 0 0]
...
Leave out ONES split 1, and Leave out ZEROS split 4
train on:  [1 1 1 0 0 0 0 0]
test on:  [1 1 0 0 0 0 0]

This method can accomplish splitting the data into partitions where all partitions are eventually left out for testing. It should be noted that not all statistical learning methods allow for weighting, so adjusting methods like CV is essential to account for sampling proportions.

Reference: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R.

answered Oct 28 '22 05:10

Jon

Related questions
                            
                                codeStyleSettings.xml always modified by Android Studio
                            
                                Using arrow functions with d3
                            
                                What does ".N" means in data table in r?
                            
                                How to group result by array column in Postgres?
                            
                                How to identify date from a string in Java
                            
                                How do I remove the first item of an array in twig?
                            
                                boolean values in Spring application.properties file?
                            
                                Debugging website on local IIS without administrative privileges
                            
                                Rows returned by pyodbc are not JSON serializable
                            
                                The ec2 instance can't access internet in a public subnet without a elastic ip address?
                            
                                Excel vba add code to sheet module programmatically
                            
                                How to create dataset similar to cifar-10 [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With