I have a dataset with binary class labels. I want to extract samples with balanced classes from my data set. Code I have written below gives me imbalanced dataset.
sss = StratifiedShuffleSplit(train_size=5000, n_splits=1, test_size=50000, random_state=0)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(itemfreq(y_train))
As you can see that class 0
has 2438 samples and class 1
has 2562.
[[ 0.00000000e+00 2.43800000e+03]
[ 1.00000000e+00 2.56200000e+03]]
How should I proceed to get 2500 samples in class 1
and class 0
each in my training set. (And the test set too with 25000)
SMOTE: Synthetic Minority Oversampling Technique SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling.
In the case of cross-validation, we have two choices: 1) perform oversampling before executing cross-validation; 2) perform oversampling during cross-validation, i.e. for each fold, oversampling is performed before training, and this process is repeated for each fold.
Sklearn.utils resample method can be used to tackle class imbalance in the imbalanced dataset. Sklearn.utils resample can be used to do both – Under sample the majority class records and oversample minority class records appropriately. I have been recently working in the area of Data Science and Machine Learning / Deep Learning.
There doesn't seem to be a method for doing balanced sampling in sklearn but it's kind of easy using basic numpy, for example a function like this might help you: Note that if you use this and sample more points per class than in the input data, then those will be upsampled (sample with replacement).
An imbalanced dataset is a type of dataset where the number of examples that belong to each class is not balanced. Note, here class refers to the output in a classification problem
In this tutorial, I deal with balancing. A balanced dataset is a dataset where each output class (or target class) is represented by the same number of input samples. Balancing can be performed by exploiting one of the following techniques:
As you didn't provide us with the dataset, I'm using mock data generated by means of make_blobs
. It remains unclear from your question how many test samples there should be. I've defined test_samples = 50000
but you can change this value to fit your needs.
from sklearn import datasets
train_samples = 5000
test_samples = 50000
total_samples = train_samples + train_samples
X, y = datasets.make_blobs(n_samples=total_samples, centers=2, random_state=0)
The following snippet splits data into train and test with balanced classes:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(train_size=train_samples, n_splits=1,
test_size=test_samples, random_state=0)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Demo:
In [54]: from scipy import stats
In [55]: stats.itemfreq(y_train)
Out[55]:
array([[ 0, 2500],
[ 1, 2500]], dtype=int64)
In [56]: stats.itemfreq(y_test)
Out[56]:
array([[ 0, 25000],
[ 1, 25000]], dtype=int64)
EDIT
As @geompalik correctly pointed out, if your dataset is unbalanced StratifiedShuffleSplit
won't yield balanced splits. In that case you might find this function useful:
def stratified_split(y, train_ratio):
def split_class(y, label, train_ratio):
indices = np.flatnonzero(y == label)
n_train = int(indices.size*train_ratio)
train_index = indices[:n_train]
test_index = indices[n_train:]
return (train_index, test_index)
idx = [split_class(y, label, train_ratio) for label in np.unique(y)]
train_index = np.concatenate([train for train, _ in idx])
test_index = np.concatenate([test for _, test in idx])
return train_index, test_index
Demo:
I have previuosuly generated mock data with the number of samples per class you indicated (code not shown here).
In [153]: y
Out[153]: array([1, 0, 1, ..., 0, 0, 1])
In [154]: y.size
Out[154]: 55000
In [155]: train_ratio = float(train_samples)/(train_samples + test_samples)
In [156]: train_ratio
Out[156]: 0.09090909090909091
In [157]: train_index, test_index = stratified_split(y, train_ratio)
In [158]: y_train = y[train_index]
In [159]: y_test = y[test_index]
In [160]: y_train.size
Out[160]: 5000
In [161]: y_test.size
Out[161]: 50000
In [162]: stats.itemfreq(y_train)
Out[162]:
array([[ 0, 2438],
[ 1, 2562]], dtype=int64)
In [163]: stats.itemfreq(y_test)
Out[163]:
array([[ 0, 24380],
[ 1, 25620]], dtype=int64)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With