I have data with 4 classes and I am trying to build a classifier. I have ~1000 vectors for one class, ~10^4 for another, ~10^5 for the third and ~10^6 for the fourth. I was hoping to use cross-validation so I looked at the scikit-learn docs .
My first try was to use StratifiedShuffleSplit but this gives the same percentage for each class, leaving the classes drastically imbalanced still.
Is there a way to do cross-validation but with the classes balanced in the training and test set?
As a side note, I couldn't work out the difference between StratifiedShuffleSplit and StratifiedKFold . The descriptions look very similar to me.
Sadly, the k-fold cross-validation is not appropriate for evaluating imbalanced classifiers.
Stratified k-fold cross-validation: Stratified k-fold cross-validation solved the problem of an imbalanced dataset. In Stratified k-fold cross-validation, the dataset is partitioned into k groups or folds such that the validation data has an equal number of instances of target class label.
The stratified k fold cross-validation is an extension of the cross-validation technique used for classification problems. It maintains the same class ratio throughout the K folds as the ratio in the original dataset.
Implementing the concept of stratified sampling in cross-validation ensures the training and test sets have the same proportion of the feature of interest as in the original dataset. Doing this with the target variable ensures that the cross-validation result is a close approximation of generalization error.
My first try was to use StratifiedShuffleSplit but this gives the same percentage for each class, leaving the classes drastically imbalanced still.
I get the feeling that you're confusing what a stratified strategy will do, but you'll need to show your code and your results to say for sure what's going on (the same percentage as their percentage in the original set, or the same percentage within the returned train / test set? The first one is how it's supposed to be).
As a side note, I couldn't work out the difference between StratifiedShuffleSplit and StratifiedKFold . The descriptions look very similar to me.
One of these should definitely work. The description of the first one is definitely a little confusing, but here's what they do.
Provides train/test indices to split data in train test sets.
This means that it splits your data into a train and test set. The stratified part means that percentages will be maintained in this split. So if 10% of your data is in class 1 and 90% is in class 2, this will ensure that 10% of your train set will be in class 1 and 90% will be in class 2. Same for the test set.
Your post makes it sound like you'd want 50% of each class in the test set. That isn't what stratification does, stratification maintains the original percentages. You should maintain them, because otherwise you'll give yourself an irrelevant idea about the performance of your classifier: who cares how well it classified a 50/50 split, when in practice you'll see 10/90 splits?
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
See k-fold cross validation. Without stratification, it just splits your data into k folds. Then, each fold 1 <= i <= k is used once as the test set, while the others are used for training. The results are averaged in the end. It's similar to running the ShuffleSplit k times.
Stratification will ensure that the percentages of each class in your entire data will be the same (or very close to) within each individual fold.
There is a lot of literature that deals with imbalanced classes. Some simple to use methods involve using class weights and analysis the ROC curve. I suggest the following resources for starting points on this:
K-Fold CV works by randomly partitioning your data into k (fairly) equal partitions. If your data were evenly balanced across classes like [0,1,0,1,0,1,0,1,0,1], randomly sampling with (or without replacement) will give you approximately eqal sample sizes of 0 and 1. 
However, if your data is more like 
[0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0]
where one class over represents the data, k-fold cv without weighted sampling would give you erroneous results.
If you use ordinary k-fold CV without adjusting sampling weights from uniform sampling, then you'd obtain something like
## k-fold CV
k = 5
splits = np.array_split(y, k)
for i in range(k):
    print(np.mean(splits[i]))
 [array([0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0]),
 array([0, 1, 1, 1, 1, 1])]
where there are clearly splits without useful representation of both classes.
The point of k-fold CV is to train/test a model across all subsets of data, while at each trial leaving out 1 subset and training on k-1 subsets.
In this scenario, you'd want to use split by strata. In the above data set, there are 27 0s and 5 1s. If you'd like to compute k=5 CV, it wouldn't be reasonable to split the strata of 1 into 5 subsets. A better solution is to split it into k < 5 subsets, such as 2. The strata of 0s can remain with k=5 splits since it's much larger. Then while training, you'd have a simple product of 2 x 5 from the data set. Here is some code to illustrate
from itertools import product
for strata, iterable in groupby(y):
    data = np.array(list(iterable))
    if strata == 0:
        zeros = np.array_split(data, 5)
    else:
        ones = np.array_split(data, 2)
cv_splits = list(product(zeros, ones))
print(cv_splits)
m = len(cv_splits)
for i in range(2):
    for j in range(5):
        data = np.concatenate((ones[-i+1], zeros[-j+1]))
        print("Leave out ONES split {}, and Leave out ZEROS split {}".format(i,j))
        print("train on: ", data)
        print("test on: ", np.concatenate((ones[i], zeros[j])))
Leave out ONES split 0, and Leave out ZEROS split 0
train on:  [1 1 0 0 0 0 0 0]
test on:  [1 1 1 0 0 0 0 0 0]
Leave out ONES split 0, and Leave out ZEROS split 1
train on:  [1 1 0 0 0 0 0 0]
...
Leave out ONES split 1, and Leave out ZEROS split 4
train on:  [1 1 1 0 0 0 0 0]
test on:  [1 1 0 0 0 0 0]
This method can accomplish splitting the data into partitions where all partitions are eventually left out for testing. It should be noted that not all statistical learning methods allow for weighting, so adjusting methods like CV is essential to account for sampling proportions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With