Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get balanced sample of classes from an imbalanced dataset in sklearn?

Tags:

scikit-learn

I have a dataset with binary class labels. I want to extract samples with balanced classes from my data set. Code I have written below gives me imbalanced dataset.

sss = StratifiedShuffleSplit(train_size=5000, n_splits=1, test_size=50000, random_state=0)
for train_index, test_index in sss.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        print(itemfreq(y_train))

As you can see that class 0 has 2438 samples and class 1 has 2562.

[[  0.00000000e+00   2.43800000e+03]
 [  1.00000000e+00   2.56200000e+03]]

How should I proceed to get 2500 samples in class 1 and class 0 each in my training set. (And the test set too with 25000)

like image 836
Krishna Kalyan Avatar asked Mar 07 '17 10:03

Krishna Kalyan


People also ask

What is smote sampling?

SMOTE: Synthetic Minority Oversampling Technique SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling.

When should you oversample?

In the case of cross-validation, we have two choices: 1) perform oversampling before executing cross-validation; 2) perform oversampling during cross-validation, i.e. for each fold, oversampling is performed before training, and this process is repeated for each fold.

How to tackle class imbalance in the imbalanced dataset using sklearn?

Sklearn.utils resample method can be used to tackle class imbalance in the imbalanced dataset. Sklearn.utils resample can be used to do both – Under sample the majority class records and oversample minority class records appropriately. I have been recently working in the area of Data Science and Machine Learning / Deep Learning.

How to do balanced sampling in sklearn?

There doesn't seem to be a method for doing balanced sampling in sklearn but it's kind of easy using basic numpy, for example a function like this might help you: Note that if you use this and sample more points per class than in the input data, then those will be upsampled (sample with replacement).

What is an imbalanced dataset?

An imbalanced dataset is a type of dataset where the number of examples that belong to each class is not balanced. Note, here class refers to the output in a classification problem

What is balancing in data science?

In this tutorial, I deal with balancing. A balanced dataset is a dataset where each output class (or target class) is represented by the same number of input samples. Balancing can be performed by exploiting one of the following techniques:


1 Answers

As you didn't provide us with the dataset, I'm using mock data generated by means of make_blobs. It remains unclear from your question how many test samples there should be. I've defined test_samples = 50000 but you can change this value to fit your needs.

from sklearn import datasets

train_samples = 5000
test_samples = 50000
total_samples = train_samples + train_samples
X, y = datasets.make_blobs(n_samples=total_samples, centers=2, random_state=0)

The following snippet splits data into train and test with balanced classes:

from sklearn.model_selection import StratifiedShuffleSplit    

sss = StratifiedShuffleSplit(train_size=train_samples, n_splits=1, 
                             test_size=test_samples, random_state=0)  

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Demo:

In [54]: from scipy import stats

In [55]: stats.itemfreq(y_train)
Out[55]: 
array([[   0, 2500],
       [   1, 2500]], dtype=int64)

In [56]: stats.itemfreq(y_test)
Out[56]: 
array([[    0, 25000],
       [    1, 25000]], dtype=int64)

EDIT

As @geompalik correctly pointed out, if your dataset is unbalanced StratifiedShuffleSplit won't yield balanced splits. In that case you might find this function useful:


def stratified_split(y, train_ratio):
    
    def split_class(y, label, train_ratio):
        indices = np.flatnonzero(y == label)
        n_train = int(indices.size*train_ratio)
        train_index = indices[:n_train]
        test_index = indices[n_train:]
        return (train_index, test_index)
        
    idx = [split_class(y, label, train_ratio) for label in np.unique(y)]
    train_index = np.concatenate([train for train, _ in idx])
    test_index = np.concatenate([test for _, test in idx])
    return train_index, test_index

Demo:

I have previuosuly generated mock data with the number of samples per class you indicated (code not shown here).

In [153]: y
Out[153]: array([1, 0, 1, ..., 0, 0, 1])

In [154]: y.size
Out[154]: 55000

In [155]: train_ratio = float(train_samples)/(train_samples + test_samples)  

In [156]: train_ratio
Out[156]: 0.09090909090909091

In [157]: train_index, test_index = stratified_split(y, train_ratio)

In [158]: y_train = y[train_index]

In [159]: y_test = y[test_index]

In [160]: y_train.size
Out[160]: 5000

In [161]: y_test.size
Out[161]: 50000

In [162]: stats.itemfreq(y_train)
Out[162]: 
array([[   0, 2438],
       [   1, 2562]], dtype=int64)

In [163]: stats.itemfreq(y_test)
Out[163]: 
array([[    0, 24380],
       [    1, 25620]], dtype=int64)
like image 83
Tonechas Avatar answered Oct 26 '22 08:10

Tonechas